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Series Editor's Preface 


Approach your problems from the It isn't that they can't see the sotution. 
right end and begin with the answers. It is that they can't see the problem. 
Then one day, perhaps you will find 

the final question. С.К. Chesterton. The Scandal of 


Father Brown ‘The point of a Pin’. 
‘The Hermit Clad in Crane Feathers’ 
in R. van Gulik’s The Chinese Maze 
Murders. 


Growing specialization and diversification have brought a host of mono- 
graphs and textbooks on increasingly specialized topics. However, the 
"tree" of knowledge of mathematics and related fields does not grow 
only by putting forth new branches. It also happens, quite often in fact, 
that branches which were thought to be completely disparate are suddenly 
seen to be related. 

Further, the kind and level of sophistication of mathematics applied in 
various sciences has changed drastically in recent years: measure theory 
is used (nontrivially) in regional and theoretical economics; algebraic 
geometry interacts with physics; the Minkowski lemma, coding theory 
and the structure of water meet one another in packing and covering 
theory; quantum fields, crystal defects and mathematical programming 
profit from homotopy theory; Lie algebras are relevant to filtering; 
and prediction and electrical engineering can use Stein spaces. And in 
addition to this there are such new emerging subdisciplines as “ехрегі- 
mental mathematics", “CFD”, “completely integrable systems", “chaos, 
synergetics and large-scale order", which are almost impossible to fit 
into the existing classification schemes. They draw upon widely different 
sections of mathematics. This programme, Mathematics and Its Appli- 
cations, is devoted to new emerging (sub) disciplines and to such (new) 
interrelations as exempla gratia: 

- a central concept which plays an important role in several different 
mathematical and/or scientific specialized areas; 
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- new applications of the results and ideas from one area of scientific 
endeavour into another; 

— influences which the results, problems and concepts of one field of 
enquiry have and have had on the development of another. 

The Mathematics and Its Applications programme tries to make available 

a careful selection of books which fit the philosophy outlined above. With 

such books, which are stimulating rather than definitive, intriguing rather 

than encyclopaedic, we hope to contribute something towards better 

communication among the practitioners in diversified fields. 

Because of the wealth of scholarly research being undertaken in the 
Soviet Union, Eastern Europe, and Japan, it was decided to devote spe- 
cial attention to the work emanating from these particular regions. Thus 
it was decided to start three regional series under the umbrella of the 
main MIA programme. 

Progress in mathematics, as in other sciences, thrives on the kind of 
questions and/or results which, so to speak, require one to twist one's 
mind upside down and out of shape a couple times in order to see that 
the solutions are both natural and beautiful. Paradoxes, i.e. counterintui- 
tive but true results, are perhaps the purest manifestations of such prob- 
lems. And probability theory, the science of random events, has always 
been and still is particularly rich in paradoxes. 

Studying and understanding a field through paradoxes is probably 
one of the better ways of gaining real intuition. For probability theory 
this would be an ideal book to do so. 


As long as algebra and geometry pro- 
ceeded along separate paths, their ad- 
vance was slow and their applications 


The unreasonable effectiveness of ma- 
thematics in science ... 


Eugene Wigner 


Well, if you know of a better 'ole, go 
to it. 


Bruce Bairnsfather 


What is now proved was once only ima- 
gined. 


William Blake 
Bussum, March 1986 


limited. 

But when these sciences joined com- 
pany they drew from each other fresh 
vitality and thenceforward marched 
on at a rapid pace towards perfection. 


Joseph Louis Lagrange. 


Michiel Hazewinkel 


Introduction 


“The fairest thing we can experience is 
the mysterious. It is the fundamental 
emotion which stands at the cradle of 
true art and science. He who knows it 
not and can no longer wonder, no long- 
er feel amazement, is as good as dead, 
a snuffed-out candle." 


Albert Einstein; Mein Weltbild, 


“It is remarkable that a science which 
began with the considerations of games 
of chance should have become the most 
important object of human knowl- 
edge...” 


Pierre Simons, Marquis de Laplace; 
Théorie Analytique des Probabilités, 
1812 


1934 Engl. translation: Ideas and 
Opinions, by S. Bargmann 


Just like any other branch of science, mathematics also describes the 
contrasts of the world we live in. It is natural therefore that the history 
of mathematics has revealed many interesting paradoxes some of which 
have served as starting-points for great changes. The mathematics of 
randomness is especially rich in paradoxes. According to Charles Sanders 
Peirce no branch of mathematics is as easy to slip up in as probability 
theory. This book aims to show how this rapidly progressing and widely 
used branch of knowledge has developed from paradoxes. It tries to show 
those exciting moments that preceded or followed the solution of some 
outstanding paradoxical problems which are rarely mentioned in mono- 
graphs. The book deals not only with interesting but not very important 
“gems” of probability theory, far from the main stream of development; 
on the contrary it emphasizes the contradictions that have done the 
most to clear up fundamental crises in the mathematics of randomness. 
The book also deals with problems that were not originally regarded as 
paradoxes. A book on paradoxes must naturally have a historical 


XII Introduction 


framework and so this book begins with the oldest paradoxes of proba- 
bility theory. 

It is important to distinguish paradoxes from fallacies. The first one 
is a true though surprising theorem while the second one is a false 
result obtained by reasoning that seems correct. Both paradoxes and 
fallacies are very interesting and instructive but this book deals mainly 
with paradoxes (exceptions are, e.g., the “paradoxes” of IV/1). In for- 
mulating the paradoxes my aim was that each paradox should be clear 
by itself. It is obvious though that the reader who does not play bridge, 
or is not familiar with normal distributions, will have more difficulty 
when these particular notions are the basis of the paradoxes. However, 
by reading the book straight through from the beginning, he will discover 
the definitions of the most important notions. (The rules of bridge are 
not discussed in the book but those which are necessary to understand 
the paradox can also be found out.) 

The book consists of four main chapters. Each paradox will be dis- 
cussed in five parts: the history, formulation, explanation of the paradox, 
remarks, and, finally, references. Each chapter finishes with quickies. 
These are not discussed in detail, not because they are of less importance 
or interest, but because they do not fit into the main line of the book. 

The initial inspiration for a book on the paradoxes of probability 
theory came from my late professor, Alfréd Rényi. A. М. Kolmogorov 
also encouraged me when we met in Budapest in 1972. In 1976, I spent 
a semester at the University of Amsterdam, where Professor A. A. Bal- 
kema drew my attention to several interesting paradoxes. Further inspira- 
tions came from the discussions following my lectures at Johns Hopkins’, 
Columbia, Yale University and at MIT. I was also fortunate to have the 
opportunity to meet and discuss probabilistic problems with George 
Polya at Stanford University and in Budapest. I would like to thank 
him for his advice. Special thanks must be given to several colleagues 
of mine at the L. Eótvós University and the Mathematical Institute of 
the Hungarian Academy of Sciences. Their names will occur frequently. 

Finally, it should be emphasized that this English edition of the book 
is a revised and updated version of the Hungarian one. 


Chapter 1 


Classical paradoxes of probability theory 


* A classic is something that everybody "Experience is the name everybody 
wants to have read and nobody wants gives to his mistakes." 
to read." 

Mark Twain Oscar Wilde 


** ..the true logic for this world is the 
calculus of Probabilities, which takes 
account of the magnitude of the proba- 
bility which is, or ought to be, in a rea- 
sonable man's mind." 

J. Clerk Maxwell 


Considerations on probabilities (such as the old golden rules of gamblers) 
can be traced back to ancient times but mathematical calculations on 
probabilities and probabilitistic paradoxes have been put in writing only 
since the beginning of modern times. Though probability theory today 
has about as much to do with games of chance as geometry has to do 
with land surveying, the first paradoxes nevertheless arose from popular 
games of chance. 


1. THE PARADOX OF DICE. “GAMES OF CHANCE" 
IN THE WORLD OF PARTICLES 


a) The history of the paradox 


Dice was the most popular game of chance up until the end of the Middle 
Ages. The word hazard refers to dice as well, for it comes from the 
Arabic ''al-zar" meaning “the dice". Card games became popular in 
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Europe only in the 14th century, while dice had already been in fashion 
in ancient Egypt during the Ist dynasty and later in Greece, as well as 
in the Roman Empire. (According to Greek tradition it was Palamedeo 
who invented dice in order to entertain the bored Greek soldiers waiting 
for the battle of Troy. A 2nd century writer, Pausanias, mentions a pic- 
ture painted by Polygnotosin the 5th century BC which showed Palamedeo 
and Thersites playing dice.) The earliest book on probability theory is a 
book by Gerolamo Cardano (1501—1576) called “De Ludo Aleae" 
which is devoted mostly to dice. This short book was published only 
in 1663 about 100 years after it had been written. It might have been 
the reason why Galileo began to deal with the same dice-problem, al- 
though it had already been solved in Cardano's work. Galileo also 
wrote a study on this theme sometime between 1613 and 1624. Its original 
title was “Sopra le Scoperte dei Dadi” but in the 1718 edition of Galileo’s 
collected works, the title was changed to ‘‘Consideratione sopra il Giuoco 
dei Dadi”. 


b) The paradox 


A fair dice, when thrown, has an equal chance of falling on any of the 
numbers 1, 2, 3, 4, 5 or 6. In the case of two dice the sum of the numbers 
thrown is between 2 and 12. Both 9 and 10 can be made up in two differ- 
ent ways out of the numbers 1,2, ... 6. 9=3+6=4+5 and 10=4+6= 
=5+5. In the 3 dice problem, both 9 and 10 can be made up in six 
ways. Why then is 9 more frequent if we throw two dice, and 10 if we 
throw three? 


c) The explanation of the paradox 


The problem is so simple to solve that it is really surprising that people 
at that time found it so shocking. Both Cardano and Galileo pointed out 
that the order of the cast must be taken into consideration. (Otherwise 
not all results would be equally probable.) In the case of 2 dice, 9 and 
10 can be made up as follows: 9—3--6—6--3—4--5—5--4 and 10= 
=4+6=6+4=5+5. This means that in the 2 dice problem we can 
throw 9 in four ways but 10 only in three ways. Therefore the chance 
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of getting 9 is more likely. (Since 2 dice can make 6-6=36 different 

4 а 3 m44 

number pairs of the same probability, the chance of getting a 9 is 36 
: : 3 ; 

while that of for 10 is only x In the case of 3 dice it is just the other 


way round. 9 can be obtained only 25 ways but 10 26 ways. So 10 is 
more probable than 9. 


d) Remarks 


(i) In spite of the simplicity of the dice problem, several great mathema- 
ticians failed to solve it because they forgot about the order of the cast. 
(This mistake is made quite frequently, even today.) Leibniz, one of the 
creators of the differential and integral calculus, and D'Alembert, one 
of the greatest authors of the famous French Encyclopedia, were both 
mistaken. D'Alembert was once asked the following question: What 
is the probability of a coin falling at least once heads if it is tossed twice? 
The scientist’s answer was 2/3, because he thought that there were only 
three possible outcomes (heads-heads, heads-tails, tails-tails) and among 
these only one is unfavourable, i.e., when we toss two tails. He neglected 
that the three possible outcomes are not equally probable. The correct 
answer is 3/4, because tossing heads-heads, heads-tails, tails-heads and 
tails-tails have the same chance and only the last one is unfavourable. 
D'Alembert's opinion was even published in the Encyclopedia 1754 
at the entry “Croix on pile". 

(ii) The dice problem has some links with 19th and 20th century 
microphysics. Suppose that we play with particles instead of dice. Each 
face of the dice represents a phase cell on which the particles appear 
randomly and which characterizes the state of the particles. Here dice is 
equivalent to the Maxwell —Boltzmann model of particles. In this model 
(used mostly for gas molecules) every particle has the same chance of 
reaching any cell, so in a list of equally probable events, the order must 
be taken into account, just as in the dice problem. There is another model 
in which the particles are indistinguishable, and for this reason the order 
must be left out of consideration when counting the equally possible 
outcomes. This model is named after Bose and Einstein. Using this 
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terminology the point of our paradox is that dice are not of the Bose— 
Einstein but of Maxwell—Boltzmann type. It is worth mentioning that 
none of these models are correct for bound electrons because in this 
case, only one particle may occupy any cell. In dice-language it means 
that after having once thrown a 6 with one of the dice, we cannot get 
another 6 on the other dice. This is the Fermi—Dirac model. Now the 
question is which model is correct in a certain situation. (Beside these 
three models, there are many others not mentioned here.) Generally we 
cannot choose any of the models only on the basis of pure logic. In most 
cases it is experience or observation that settles the question. But in the 
case of dice, it is obvious that the Maxwell—Boltzmann model is the 
correct one and at this moment that is all we need. 


e) References 


A classical monograph on the history of the classical probability theory is: 
Todhunter, I., History of the Mathematical Theory of Probability, which was firstly 
published in 1865 and republished in 1949 by the Publishing House Chelsea. 

The description of the prehistory and earliest periods of probability theory can be 
found in: 

David, F. N., Games, Gods and Gambling. Griffin, London, 1962. 

The following books point out the historical and philosophical aspects of early prob- 
ability theory: 

Hacking, I., The Emergence of Probability. Cambridge University Press, 1973. 

Maistrov, L. E., The Development of the Notion of Probability (in Russian). Nauka, 
Moscow, 1980. 

Readers interested in the history of early probability theory may find further details 
in the periodicals Biometrika and Archive for the History of Exact Sciences. 
The English translation of the first monograph on the problems of dice as well as the 

detailed biography of its author Gerolamo Cardano can be found in 
Ore, Ø., Cardano, the Gambling Scholar. Princeton University Press, Princeton, 1953. 
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2. DE MÉRÉ'S PARADOX 


a) The history of the paradox 


There is an old story (probably from Leibniz) that the well-known 17th 
century French gambler Chevalier de Méré was on his way to his estate 
in Poitou when he met B/aise Pascal, one of the most famous scientists 
of the century. De Méré posed two problems to Pascal, both connected 
with games of chance. The first problem was the paradox in question (the 
second one is the next paradox). In 1654 Pascal corresponded with 
Pierre de Fermat, another highly gifted scientist living in Toulouse, about 
these two questions. They both came tothe same result, which pleased 
Pascal very much. He writes in a letter: “I see that the truth is the same 
in Toulouse and in Paris." Oystein Ore, professor at Yale University, has 
pointed out that the paradoxes attributed to de Méré had, in fact, been 
common knowledge much earlier, it was just that Pascal had not known 
about them. Nor is it true that the chevalier was a passionate gam- 
bler. He was interested in paradoxes theoretically rather than practically, 
which is why he was not satisfied that Pascal had “only” solved the 
problem, confirming the answer he already knew was right He could 
not see from the solution how the contradiction was solved. 


b) The paradox 


In four throws of a single dice the probability that we get at least one ace 
is more than 1/2, whereas in 24 throws of 2 dice the probability that we 
get a double ace (at least once) is less than 1/2. This seems surprising since 
the chance of getting one ace is six times as much as the chance of a double 
ace, and 24 is exactly 6 times as great as 4. 


c) The explanation of the paradox 


If one true dice is thrown k times, then the number of possible (and 
equally likely) outcomes is 6*. In 5* cases out of these 6*, the dice does not 
turn up a six, hence the probability of throwing at least one ace in k 
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throws is 


6 — 5 зү, 
See alah 
and that is greater than 1/2 if k=4. On the other hand, the quantity 


35 MM 
1— zi]. which we can obtain in the same way, is still smaller than 


1/2 for k —24 and only exceeds 1/2 for k—25. So the “critical value" 
is 4 for a single dice and 25 for a pair of dice. This undoubtedly correct 
solution did not in fact satisfy de Méré as he had known the answer 
itself; it was just that he did not understand why it was incompatible 
with the “proportionality rule of critical values", which says that if the 
probability decreases one sixth times, then the critical value increases 
six times (4:6 —24:36). Abraham de Moivre (1667—1754) proved in his 
book “Doctrine of Chances", published in 1718, that the “ргорогіоп- 
ality rule of critical values" was not far from the truth. For if p is the prob- 
ability of an event (for example, the probability of throwing an ace 
is p=1/6), then the critical value k can be calculated by solving the 
equation 


ü-» = 5 


(the equation can be solved if p is strictly between 0 and 1). The critical 
value k is the smallest integer which is greater than x. The solution of 
the above equation is: 


жен Mots En» 
In(l—p) = p+p?/2+...’ 


where In denotes natural logarithm (base e=2.71...). It is apparent from 
this solution that if р? is negligibly small, then p decreases approximately 
in proportion to the increase in the critical value, just as de Méré thought. 


In2 0.69 
SE tO. exe 


(9 


De Moivre used the approximation formula xz 
p p 
amine the Royal Oaks Lottery (the London Lottery). In that case the value 


1 
of р was 1/32 апа for Fees the correct value is x=22.135..., while 
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the above formula gives the approximation 22.08, which is very near to 
the correct value. De Méré's paradox occurred because for p=1/6, 
р?/2 (and other terms in the denominator of formula (*)) are not small 
enough to neglect. Thus the “proportionality rule of critical values” is 
just an approximate rule, the error of which increases as p increases. 
This is the real solution of the paradox. 


d) Remarks 


(i) A typically incorrect solution of de Méré's problem goes back to 
Cardano. He reasoned as follows: the probability that we get a double 
ace is 1/36 so we have to throw the dice exactly 18 times to get a double 
ace at least once with probability 1/2. According to this reasoning, in 
more than 36 throws the probability that we get a double ace is more than 
1, which is, of course, nonsense. 

(її) There aresome “random quantities" which obey the “proportion- 
ality rule". (We shall discuss these random quantities in Paradox 8.) 
Some of these random quantities are very important in atomic physics, 
where the critical value is called half-life. This is inversely proportional 
to the decay constant, which corresponds to p. 

(iii) The number of throws of a true dice necessary to turn up the first 
ace is a quantity depending on chance, a random variable. Let us denote 
this random variable by v. The possible values of v are 1, 2, 3, 4, .... 
The probability that v=k (where k is a positive integer) is (=) Е 
So the mean or expected value of v, (defined as the weighted average of 
its possible values, the weights being the corresponding probabilities), is: 


БСН p: 3 
122 zt ыр =) 2+... =6, 


Somewhat more generally, if p is the probability of the realization of an 
event A, and we repeat independent trials until A occurs, the expected 
value of the number of necessary trials is 1/p. Thus these kinds of expect- 
ed values obey the ‘‘proportionality rule": we need six times as many 
throws on average to get a double ace as to get one ace. 


7% 
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(iv) Up till now an intuitive term of probabilistical independence has 
been used. We shall return to this later. 

(v) The explanation of de Méré's paradox did not become widely 
known. In 1693, nearly forty years after Pascal had solved the problem. 
Samuel Pepys (President of the Royal Society from 1684) proposed 
almost the same problem to Newton. Newton also found the right 
answer, but he could not satisfy Pepys either. 

(vi) “Dreydel’’ or "draydl" is an ancient game similar to dice. (It 
also resembles the English game put-and-take.) Dreydel is played by 
Jews at the Chanukah festival. Quite recently Feinerman discovered 
(see reference below) that this game is unfair if the number of players 
is more than two, though, paradoxically, nobody had observed this 
fact for over 2000 years! 

The dreydel is a four-sided top whose sides are denoted by the letters 
N, G, H and S (corresponding to the Hebrew letters Nun, Gimel, Hay 
and Shin). The game is played with any number of players, each of whom 
contributes one unit to the pot to start the game. The players continue 
to take turns spinning the dreydel until some mutually agreed stopping 
point. The payoffs (to the spinning player) corresponding to each of the 
four equally likely outcomes are, N: no payoff, H: half the pot, G: 
entire pot, S: put one unit into the pot. When one player spins a G, he 
collects the entire pot, and all the players then contribute one unit to 
form the new pot. If the number of players is denoted by m then the 
expected value of the payoff of the n-th spin is E,—m/4 + (5/8) -* (m—2)/8. 


(A) (В) (С) 


Figure 1. The faces of a regularly and two irregularly spotted dice. Rolling the 
two true dice which are irregularly marked by spots (B and C), the probability that 
the sum of the numbers we score is 2, 3, ..., 12 is the same as in the case of two 
regularly spotted dice. 
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We thus see that if m>2 then E, is a strictly decreasing sequence. There- 
fore the first player (whose spins correspond to 7—1, т+1, 2m4- 1, ...) 
has a term-for-term greater expected payoff than the second play- 
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3. THE DIVISION PARADOX 


a) The history of the paradox 


This paradox was first published in Venice in 1494 in a summary of the 
mathematics of the Middle Ages. The author, Fra Luca Paccioli (1445— 
1509), entitled his book “Summa de arithmetica, geometria, proportioni 
et proportionalità". This book uses the word million and explains the 
rules of double entry for the first time. It is interesting to note that Fra 
Luca and Leonardo da Vinci became close friends in Milan and, due 
to this friendship, Leonardo illustrated Fra Luca's work “De Divina 
Proportione", which was published in Venice in 1509. Oystein Ore 
recently found an Italian manuscript dating from 1380 which also men- 
tions the paradox of division. Many a thing indicates that the problem 
is of Arabic origin, or at least reached Italy through Arabic teaching. 
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However old this problem may be, it is a fact that it still took a very long 
time the question to be correctly solved. Paccioli himself did not even 
realize its connection with probability theory, for he considered it simply 
a problem in proportions. An incorrect solution was given by Niccolo 
Tartaglia (1499—1557), though he was such a genius that he discov- 
ered the formula to solve cubic equations in one night in a mathe- 
matical duel. After several unsuccessful attempts, Pascal and Fermat 
eventually gave the right answer to the problem independently of each 
other in 1654. It was such an important discovery that this date is consid- 
ered by many people to be the birth of probability theory, and all the 
previous results to belong only to its prehistory. 


b) The paradox 


Two players are playing a fair game (i.e., both of them have the same 
chance of winning) and they have agreed that whoever wins 6 rounds 
first gets the whole prize. Let us suppose that the game actually stops 
before one of them wins the prize (e.g., the first player has won 5, the 
second 3 rounds). How could the prize be divided fairly? Though this 
problem is not, in fact, a paradox, the unsuccessful attempts of some of 
the greatest scientists to solve it, and the wrong, contradictory answers 
created the legend of a paradox. One of the answers was to divide the 
prize at the rate of the rounds won, i.e., 5:3. Tartaglia suggested a divi- 
sion at the rate of 2:1. (Most probably he thought that the first player 
had won two rounds more than the other, which is one third of the 
necessary 6 rounds, so the first player should get one third of the prize 
and the rest should be divided fifty-fifty.) As a matter of fact the fair 
rate is 7:1 which is far from the previous results. 


c) The explanation of the paradox 


Both Pascal and Fermat considered it a problem of probabilities. So 
the fair division is a rate of the chance of the first player to win against 
the second one. We shall calculate that in a case where the first player 
needs only one round to win while the second player needs three, the 
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fair rate is 7:1. Following Fermat's idea we shall continue the game with 
3 fictitious rounds even if some of them seem to be superfluous (i.e. 
when one of the players has already won the game). This extension makes 
all the possible 2.2.2=8 outcomes equally probable. Since there is 
only one outcome when the second player gets the prize (i.e., when he 
wins all the 3 rounds) while in the other cases the first player wins, the 
fair rate is 7:1. 


c) Remarks 


(i) The general solution for the case when the first player needs n and 
the second player needs m more rounds to win was also due to Pascal 
and Fermat. The chance that the first player gets the prize is 


1 n+m—1 [m 


п+т—1 ; 
2 j=n J 


(Here the number of fictitious rounds is n--m —1 and all the possible 
2"+т-1 outcomes are equally probable.) In 1654 the whole of Paris was 
talking about the discovery of a new science, i.e., the probability theory. 
Some months later a young genius, Christian Huygens arrived there from 
Holland to discuss with either Pascal or Fermat the problems of probabil- 
ity which he was also concerned with. As it happened he was unable to 
meet either of them. (Pascal was in ecstasy over religion and did not receive 
guests and Fermat lived far from Paris.) Nevertheless he heard of the 
most interesting results. He soon returned to Holland and began to write 
his book on probability theory. This excellent work, which also contains 
the solution of the problem of division for 3 players, was published in 
1657 under the title of “De Ratiociniis in Aleae Ludo" as a part (the 
fifth book) of Schooten's “Exercitationes Mathematicarum". Huygens’s 
work totals 16 pages and consists of a short preface and 14 propositions 
on gambling. 

(ii) Fermat's beautiful idea of extending the play was applied by 
Anderson (see the reference below) in 1977! He reached the following 
striking theorem: Whether “service” is altered or the winner of one game 
serves next, the initial server will still have the same probabilities of win- 
ning N games before his opponent does. 
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e) References 
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4. THE PARADOX OF INDEPENDENCE 
a) The history of the paradox 


First of all the independence of two random events А and В is defined. 
Let us denote their probabilities by Р(А) and P(B) and let P(AB) be 
the probability that both А and В occur. (The symbol P is widely used 
to denote the probability of an event, since not only in English but in 
many other languages the initial letter of the word “probability” is 
P — probabilitas in Latin, probabilité in French, probabilidad in Spanish, 
probabilità in Italian etc.) Let 4 be an arbitrary event and B an event 
with a positive probability. The probability of А, given that B has oc- 
curred, i.e., the conditional probability of А on the hypothesis В, will 
be denoted by P(A|B) and defined by the ratio 


P(AB) 


Two events 4 and B are said to be independent if equation 
P(A|B) — P(A) 


holds, that is, if the conditional probability equals the unconditional one. 
If we write the above equation in the form 


P(AB) = P(A) - P(B) (*) 


we get a simple equation symmetric in 4 and B, where we do not even 
have to assume that P(B) is positive. It is therefore preferable to start 
from the following definition: two events A and B are independent if 
equation (*) holds. 

The mathematical definition of independence and what we generally 
think about independence are in harmony. For example, throwing two 
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dice, the events “асе with first dice" and “асе with second dice" are clearly 
independent in an everyday sense and also in a mathematical sense. The 
harmony however did not seem to be perfect. It was S. N. Bernstein who 
called attention to the following paradox. 


b) The paradox 


When tossing two true coins, let А be the event “the first coin falls heads", 
B the event “{һе second coin falls heads" and C the event “опе (and 
only one) of the coins falls heads". Then the events A, B and C are pair- 
wise independent but any two of them uniquely determine the third one. 


c) The explanation of the paradox 


First of all it is obvious that А and В are independent since the result 
of the first throw is independent of the second one. The events 4 and C 
(and also B and C), however, do not seem to be independent at first 


1 
sight, but, since P(AC)=P(A) TUE and similarly P(BC)= 


= P(B) - P(C), they are really independent. It is also true that any two 
of the events determine the third one because each event (А, B and C) 
occurs exactly when one and only one of the other two events occurs. 
This paradoxical phenomenon shows that pairwise independence does 
not mean that events are independent as a whole. If we want to express 
the latter, we have to assume more than pairwise independence. A set 
of events is called mutually independent if for an arbitrary choice of 
finitely many events A,, 4, ..., Án, the multiplication rule 


P(A, Ag ... An) = Р(А) РА): ... Р(А„) СО 


holds, i.e., if the joint probability of the events is equal to the product 
of the individual (marginal) probabilities. 
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d) Remarks 


(i) If the events 41, Ao, ..., A, are not necessarily independent then we 
can only state that 


1 n 
сш & P(4 Ape d= 
— P(A,)+ P(Aj)- ...: P(A,) = (n—1)n-"/-2, 


(ii) Several simple paradoxes can be solved only with the help of the 
notion of independence. Let us examine the following problem. A boy 
is going to play three tennis matches against his mother and father, and 
he has to win twice in succession. The possible orders of matches are 
*'father-mother-father" or '*mother-father-mother". The boy has to de- 
cide which order is more favourable to him knowing that his father 
plays better than his mother. At first one might think that the second or- 
der is preferable to the boy as he plays twice with his mother in this 
version. Yes, but in this case the boy has to win the only match he plays 
against his father, otherwise he will not win twice in succession. Is it 
perhaps preferable to choose the first variation? If the boy wins against 
his father with probability p and with probability q against his mother, 
then p-q since his father plays better than his mother. Choosing the 
first variation the boy has to win either the first and second matches, the 
probability of which is pq or the second and third ones, the probability 
of which is qp. Thus the probability that one of these two events will 
occur is рӯ --qp —pqp (pqp has to be subtracted or else the probability 
of the boy winning three times is taken into account twice). Similarly if 
the boy chooses the second possible variation then the probability that 
he wins twice in succession is qp--pq—dqpq. Since p-q, it follows that 
pq +qp —pqp>qp +pq —qpq, which means that it is preferable for the 
boy to choose the ''father-mother-father" version! 

(iii) We can also define the independence of random variables. Let 
Xi» Хь, ... be arbitrary random variables assuming real values. The 
variables are called mutually independent (or independent, for short), 
if the events 


As. = {X1 一 xi) Аз = {Xa = Xs), “өө 
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are mutually independent for arbitrary real values of ху, Xs, .... The 
function F(x)=P(X<x) is the distribution function of the random 
variable X, and the function F(x,y, ..., w) - P(X—x, Y «y, ..., W=<w) 
is called the joint distribution function for the random variables X, Y, ... 
..., W). Now we can define the mutual independence of any (finite of 
infinite) set of random variables in the following way: a set of random 
variables is called independent if for any finite subset S of this set, the 
joint distribution function of the random variables in S is equal to the 
product of their individual (marginal) distribution functions. If the distri- 
bution function F(x) and the joint distribution function F(x, y, ... w) 
can be written in the following form 


F(x) = F f(X) dx 


and 


x y w 


F(x, y,..., м) = j и ЈУ, Ӯ, .... 9) ахау... dv, 


一 co 一 co 


then we call the functions f (X) and f(X, y, ..., W) density functions. If 
these density functions exist, independence means that the joint density 
function is equal to the product of the individual density functions. 

(iii) If the density function f(x) of the random variable X exists 
then its expectation is 


co 


0 

The expectation of (X —E(X ))? is called the variance of X. Its positive 
square root is the standard deviation, which is a measure of the disper- 
sion of X around its mean value. (There exist other measures of spread 
but standard deviation is undoubtedly the most important one. The 
first use of terms “standard deviation" and “variance” is due to K. 
Pearson (1895) and R. Fisher (1920), respectively.) If the density 
function of X is f(x), then its variance 


oo 


D(X)= f (X-EQOY УО) dx. 


— ео 
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If X and Y are independent, then E(XY)—E(X)E(Y) and D(X+Y)= 
—D*(X)--D*(Y) (provided that the variances of X and Y exist). The 
equation E(X+Y)=E(X)+E(Y) holds without assuming the inde- 
pendence of X and Y. 


e) References 
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5. THE PARADOX OF BRIDGE AND LOTTERY 


a) The history of the paradox 


The history of games of chance can be traced back to ancient times. 
They became so widespread that certain states and religions considered 
it their duty to suppress them. Frederick П, emperor of the Holy Roman 
Empire, banned dice in 1232. (At that time it was probablythe only popu- 
lar game of chance.) Louis 1X, King of France, decreed in 1255 that even 
dice making was illegal. In the Jewish Talmud a gambler was considered 
a thief and the Church pursued hazarders as well. 

Among modern games of chance card games are undoubtedly the 
most widespread. The word “сага” comes from the Greek word 
Хаотңс̧ = paper, however card games per se date back to times preceding 
the invention of paper. Though we do not know where card games come 
from, they seem to have reached Europe through Venice via China— 
Persia—Syria—Palestine in the 13th century, at the time of the Cru- 
sades. The facts areas follows. According to a 17th century Chinese 
Encyclopedia, card type games were already known in China about 
1120 AD. Parts of a 13th century Arabic card can be seen in the Istambul 
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Topkapi Serai Museum. A Florentine decree banned a card game called 
"naibi" in 1376. According to a 1377 manuscript, kept in the British 
Museum, card games became popular about that time in Switzerland. 
The Bibliothéque Nationale in Paris has 17 Tarot cards which were made 
for Charles IV in 1392. Johannes Gutenberg printed Tarot cards the 
same year as his famous Bible. The modern deck is derived from Guten- 
berg's Tarot cards. In the Tarot deck there were 78 cards; the 22 high- 
ranking cards were known as “atouts” (i.e., "above all"; later, these 
"atouts" were called “‘trump’’). Some decades later the French dropped 
the 22 “atouts” and the 4 "Knights" the remaining 52 cards became the 
modern deck. Since then the number of the popular card games has risen 
to several hundred, while the number of the card sharpers has also in- 
creased. This fact is reflected in Caravaggio's famous painting “The Card- 
sharps”? which was painted in 1593. In 1765 a Paris police lieutenant, 
Gabriel de Sartine, introduced roulette in order to reduce the influence 
of the sharpers. It became the most glamorous casino game and the oldest 
still in operation. Since the 17th century, lottery type games organized 
by the state have also become more and more popular. The first public 
lottery awarding money prizes, the Lotto de Firenze, was established in 
Florence in 1530. Another variation came into being in 1620 when the 
council of Genova needed five more members to be complete. These 
members were chosen from among 90 citizens whose names were put 
in an urn and the five names were pulled out. The citizens of Genova 
were allowed to bet on the five lucky citizens. Even today card games, 
roulette, lottery and other games of chance are very popular. Sometimes 
certain winning strategies appear claiming to be "absolutely reliable" 
but in tact they have no scientific background. On the other hand, exact 
scientific theories are only known by the very small society of mathema- 
ticians. These theories generally support the empirical rules used in 
practice. However, mathematical theorems may contradict common 
sense and become the source of the paradoxes. Here we only deal with 
two of them. 
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b) The paradox 


(i) The paradox of bridge 

Let us suppose that in a two-hand coalition of 26 cards there are 6 
trumps altogether. Then the most probable distribution of trumps is the 
following: 4 in one hand and 2 in the other. The exact probability of this 


78 1 ms 
distribution is Тер which is a little less than >” while the probability 


1 286 
of the 3—3 distribution is just a little more than 3° exactly 905° Now 


suppose we have to throw trumps twice and in the coalition both hands 
can do so. In this case in the two-hand coalition there remain only 2 
trumps. Either one hand has both cards or each of them has one. If 2 
trumps and 20 other cards are distributed between two hands then the 


10 
chance that one of them will get both trumps is 21" while the probability 


11 
of the other case is ТТ So the latter is more probable, i.e., the more 


probable distribution 1—1 comes from the less probable distribtuion 3—3. 
Is this not a contradiction? 


(ii) The paradox of lottery 


Most lottery players would not give a “too symmetrical" tip though every 
tip has the same chance. The reason is very simple. They know from 
experience that generally an irregular tip wins. In fact it is more advan- 
tageous to give a very symmetrical tip just because it is avoided by most 
of the other players. 


c) The explanation of the paradoxes 


(i) The chance that we get a 3 一 3 distribution of the trumps is: 


(5) (to) 286 


26) 805: 
13 
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Similarly for the 2 一 4 and 4 一 2 distribution it makes 
6) (20 6) (20 
200) M9) 78 
26 26 TIE 
13 13 
so the second distribution is more probable. If we have 2 trumps and 20 
other cards the 1—1 trump distribution has a chance of 


(io) _ u 


gia) 14 
11 
ae 10 
The probability for the complementer event is of course zT as stated. 


But then where is the mistake? First of all we will show where it cannot 
be. It is natural to think that after trumping (and seeing that both hands 
have thrown 2 trumps) the probabilities have changed due to the in- 
formation we have acquired meanwhile. It is true that the conditional 
probabilities (the condition is that both hands had at least 2 trumps) 
are different from those without condition but both probabilities are 
multiplied by the same number when we calculate the conditional prob- 


abilities. Consequently, their rate does not change and so the paradox 


i t solved (5 Srt thus th ditional probabilities 
ot SO کے ت سے‎ e conaitiona аОошпе; ге 
0 p 2 


805 e 
76 times тоге than the ones without conditions.) Тһе real source оѓ 


error is the following. If the original distribution was 3 一 3 trumps, then 
both hands can throw their trumps in 3-2=6 different ways. That gives 
6-6=36 possible events altogether. If the distribution was 4 一 2 or 2—4, 
then they could throw their trumps only in 4.3.2.1=24 ways. As we 
can see now the original distribution of trumps before trumping is very 
important. If we take the original situation into consideration then we 


24 2 
get a rate which is only EY EY of the rate calculated above. Really 


ard 2 11 10 11 я 
pros 5 — is — of the rate —-:—-=—. Now Ше paradox is 
805 161 DIE 3 21 "21 10° 


solved entirely. 
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(ii) It is not at all suprising that symmetrical or regular tips very sel- 
dom win. If the tip consists of 5 numbers then the possibilities by 90 
numbers are about 44 million (exactly 43,949,268), while regular fives 
make only a few thousand. In the case of a regular 5-number-tip which 
is very seldom given by others, (though the chance of winning remains 
the same) the prize would actually increase. The player would certainly 
see it after a while if he played with a lot of lottery tickets. 


d) Remarks 


(i) Rarely given 5-number-tips can be traced back easily as it is always 
in the news how many players have got 2, 3, 4 or 5 hits and how much 
the prizes were. (If a more frequent 5-number-tip is drawn, prizes are 
less.) In case of football pools mathematical analysis is a bit more compli- 
cated because there are no fixed tips. Calculations may rely upon the 
tips of certain newspapers and the number of people taking their advice. 

(ii) Works on games of chance (from roulette which is basicly hazard 
to bridge where the influence of randomness is reduced to minimum) must 
fill several libraries. In the 20th century the general theory of games was 
also developed mainly due to the work of John von Neumann. We will 
come to it later. 

(iii) The following paradox appeared in 1693(!) in the Philosophical 
Transaction of the Royal Society (17, 677—681). “An Arithmetic Para- 
dox, concerning the Chances of Lotteries" by the Honourable Francis 
Roberts, Esq; Fellow of the R. S. 


*As some Truth (like the Axioms of Geometry and Metaphysics) 
are self-evident at the first View, so there are others no less certain in 
their Foundations, that have a very different Aspect, and without a 
strict and careful Examination rather seem repugnant. We may find 
Instances of this kind in most Sciences. ... I shall add one Instance in 
Arithmetic, which perhaps may seem as great a Paradox as any of the 
former. 


There are two Lotteries, at either of which a Gamester paying a Shilling 
for a Lot or Throw; the First Lottery Upon a just Computation of the 
Odds has 3 to 1 of the Gamester, the Second Lottery but 2 to 1; neverthe- 
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less the Gamester has the very same disadvantage (and no more) in 
playing at the First Lottery as the Second." 

The example following this problem points out (we use here modern 
terminology) as follows. Let X denote our prize depending on randomness 
in a game. (In case we lose X is negative.) Let X * —X if X is positive 
and otherwise 0, X7 =X if X is negative and otherwise 0. Let Y, Y +, Y- 
denote the same random values in another game. Though the expected 
value of X and Y are the same according to the example the rate of the 
expected value of X* and X7 may differ from that of Y * and Y7. 
This means that from E(X)—E(Y) it does not follow that E(X*)/ 
E(X-)-E(Y *)JE(Y 7). Hardly anybody would wonder about this re- 
sult. (Obviously, if the expected value of both X and Y is O then the 
above rates are equal for both take the value 1.) If the problem is 
nevertheless considered a paradox, it should rather be called the para- 
dox of expected value than “ап arithmetic paradox". 
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6. THE PARADOX OF GIVING PRESENTS; HORSE KICKINGS; 
TELEPHONE CALLS; MISPRINTS 


a) The history of the paradox 


Classical probability theory dealt mainly with combinatorial questions 
(connected with games of chance). In these problems random events 
usually had a finite number of possible outcomes and all outcomes had 
the same probability. In this simple case the probability of an event 
(A) is the ratio of the number of “favourable” cases to the “total num- 
ber of cases". The first detailed monograph on probability theory also 
dealt with such probabilities. This was a book by Rémond de Montmort 
published in Paris 1708. The “Paradox of Giving Presents” is a variant 
of a problem discussed in Montmort's book in the language of card- 
games. 


b) The paradox 


The members of a company decide to give each other presents in the 
following way. Everybody brings a present, which is put with the others, 
mixed and distributed at random to the people. This is a fair way of 
distributing presents and is usually applied in the belief that the proba- 
bility of a match, i.e., somebody getting his own present, is very small 
if the company is large. Paradoxically, the probability of at least one 
match is much larger than the probability of no matches (except if the 
company consists of exactly two members, when the chance of no matches 
is 50%). 


c) The explanation of the paradox 


Consider a company of n people; then the number of presents is also n. 
The presents can be distributed in п! different ways. (This is the total 
number of cases.) The number of cases when nobody gets his own pre- 


sent is 
(0) (1) 101 (3) е2). 


ela) (пЗ) Стор 
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thus the ratio of favourable cases to the total number of cases is : 


1 
and p, is really smaller than z if n>2. 


4 1 

At the gathering of at least 6 people for example (n=6), р, — ғу 
e 

2:0.3679 accurate to four decimal places. The probability of a certain 


н я : 1 
match, i.e., that a certain person gets his own present, is clearly —, 
n 


1 
and — converges to 0 as п increases. This paradox shows that “many 
n 
a little makes a mickle”: in spite of the small probabilites (5) of certain 
n 


matches, the probability of having at least one match is roughly 2/3. 


d) Remarks 


(i) The probability p, converges to e7" as n increases. If n is at least 6 


then p,—e^! accurate to four decimal places. More generally the prob- 
=l 


ability of having exactly k matches is (in the above sense). 


k! 

(ii) We shall examine another problem connected with the paradox 
of giving presents. Consider again a company of n people and n presents. 
Now the presents are distributed such that every person may get every 
present with the same probability independently of the distribution of 
other presents. Thus it may happen that somebody gets more than one 
present and others do not get any presents at all. Presents can be distrib- 
uted now in n” different ways (n” is the total number of cases). Let A be 
the event that a certain person does not get any present. Then all the n 
presents are distributed among the remaining (n—1) people and this 
can be done in (n — 1)" different ways. Therefore the probability of event 


A is 
_ (n-W _ | n а 
Qn = п" Е" 1 п У 


3* 
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The sequence q, also tends to e^! just like p, did. Generalizing our result: 
the probability that a certain person gets exactly k presents converges to 
= 

k! 
is not necessarily equal to the number of presents (m). In this case the 


as п- х. Consider now the case where the number of people (и) 


1 \m 
probability we seek is а= (i -—| . If the ratio ias tends to a param- 
n n 


eter A (i.e., if the average number of presents per person is 4 or tends 
to A) then q, converges to e~* (where A can be an arbitrary positive 
real number). Finally the probability that a certain person gets exactly k 
presents converges to 


NO S 
TE Ces 


We say that a random variable taking only non-negative integer values 
has the Poisson distribution if it assumes the value k with probability гу. 
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Figure 2. Poisson distribution with parameters A=2 and 4— 5. 


As we have seen above the random number of presents that a certain 
person gets approximately follows the Poisson distribution with param- : 
eter A, if the average number (expected value) of presents per person 
is A. Returning to the “Paradox of Giving Presents” the number of 
people who get their own presents also follows the Poisson distribution 
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with parameter A=1, and this is quite natural since on the average there 
is only one person who gets his own present. (The probability that a 


. : TH Н 
certain person gets his own present is — and for п people it adds to 
n 


unity whatever the value of n.) 

(iii) The notion of Poisson distribution appeared first in a book by the 
French scientist Simeon Denis Poisson (1781—1840). (Section 81 of 
"Recherces sur la Probabilité des Jugements en Matiére Criminelle et 
en Matiére Civile, Précédées des Régles Génerales du Calcul des Probabi- 
lités", published in 1837, deals with the application of probability theory 
in trials.) Poisson discussed the following problem. Consider an experi- 
ment in which the same phenomenon is repeatedly observed. It is assumed 
that the trials in this experiment are independent, there are only two 
possible outcomes for each trial and their probabilities remain the same 
throughout the trials. (An experiment of this type is called a sequence 
of Bernoulli trials.) It is usual to refer to the outcome with probability p 
as a "success" and the other as “failure”. An example of Bernoulli trials 
is provided by successive tosses of an unbalanced coin. Let b, be the 
probability that п (Bernoulli) trials result in k successes (e. g., the proba- 
bility of exactly k heads in п tosses of an unfair coin). Then 


pores (i) apy, ks 1 zm 


Thus the number of successes S, in m Bernoulli trials is a random var- 
iable, which takes the value k with probability b,. A random variable 
with possible values 0, 1, 2, ..., is said to have binomial distribution if 
it assumes the value k with probability b,. The attribute “binomial” 
refers to the fact that b, is just the kth term of the binomial expansion 
of (p+(1--p))", since by the binomial formula (p--(1 —p))' —bo-- b, + 
+b,+...+b,. Poisson discovered that if p is made smaller and smaller 
at the same time that п is made larger and larger, so that the product 
np — А is fixed, then b, tends to rą. Thus the Poisson distribution is an 
approximation to the binomial distribution. The wide applicability and 
great importance of the Poisson distribution was not realized in the 
middle of the last century; moreover it almost completely fell into 
* oblivion. After 1894, however, it was applied to a very strange phenome- 
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non. Statistics were made of how many soldiers had been killed by horse 
kicks during the 20 years between 1875 and 1894 in 14 different corps 
of the German Army. According to the 280 data, 196 soldiers had died 
this way, that is, 4=0.7 on average. If the number of fatal horse-kicks 
followed the Poisson distribution with the parameter A=0.7, then we 
would expect no death in 139 cases, 1 death in 97 cases, 2 deaths in 34 
cases, etc. out of the 280 cases. And what did the statistics show? The 
actual data were 140, 91, 32 etc., respectively; practice and theory are in 
such close agreement that we would have hardly expected more. 

This comparison appeared in 1898 in the famous monograph by 
L. Bortkiewicz. The title of his book “Law of Small Numbers" refers 
to the fact that in the Poisson approximation p tends to 0 as n increases. 
(The title is quite misleading since it suggests that the Poisson approxima- 
tion is in contrast in some way with the laws of large numbers, which will 
be discussed later on). The Poisson distribution only began to be applied 
widely in the 20th century. For example, the number of certain kinds of 
goods sold on a given day approximately follows the Poisson distribu- 
tion, or the number of hemoglobins visible under the microscope, the 
number of strikes and wars in a year, the number of misprints in a text, 
or the number of telephone connections to a certain number on a certain 
day also follow approximately the Poisson distribution. If the average 
number of hemoglobins or misprints or telephone connections is 4 then 
they follow the Poisson distribution with parameter 4. If N telephone 
lines are available in a telephone exchange then the number of busy lines 
has approximately the Poisson distribution. Problems of this type were 
studied by the Danish mathematician 4. K. Erlang (1878—1929). He 
pointed out, in 1906, that a better approximation can be obtained by 
using the following truncated Poisson distribution: 


处 
e, =, where k=0,1,..., N, 


d Un 
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Since that time this has been called the Erlang distribution. 


and 
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7. ST. PETERSBURG PARADOX 
a) The history of the paradox 


Probability theory, which originally described exepriences connected 
with games of chance, developed into a theory of great universality and 
gained ground in many fields of life. Thus it was not surprising that 
almost every notable scientific journal followed the example of the 
English “Philosophical Transactions” and published articles on proba- 
bility theory regularly. More and more scientists thought that probability 
was none other than the very guide of life, reason in terms of figures. 
However, in the early 1700s the Academy of St. Petersburg published an 
article in which the mathematical calculation did not seem to be in har- 
mony with reason. Daniel Bernoulli wrote the article and made the 
Petersburg paradox known, but it was his cousin Nicolaus Bernoulli 
who had first raised the problem and mentioned the paradox in a letter 
written to Montmort in September 1713. (The Bernoullis were a renowned 
family of mathematicians, several members of which dealt with proba- 
bility theory, especially James Bernoulli, who will be mentioned later in 
connection with the laws of large numbers.) 


b) The paradox 


A single trial in the Petersburg game consists of tossing a fair coin until 
it falls heads; if this occurs at the rth throw the player receives 2" dollars 
from the bank. Thus the gain redoubles at each toss. The question is the 
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following: how much money should the player pay as an entrance fee so 
that the game will become fair? The Petersburg game was considered fair 
in the classical sense, if the mean (or expected) value of the net profit is 0, 
but surprisingly we cannot fulfill this natural requirement no matter how 
much (finite) money we pay. 


c) The explanation of the paradox 


The loss of the bank has infinite expectation since the probability that 
the game ends at the kth toss is 1/2* and in this case the player receives 
2* dollars. Then the bank has to pay 


1 1 1 
t 4+ 8+ -=1+1+1+.. 


dollars on average which is an infinite quantity of money, so an infinite 
amount of money would be a fair entrance fee. Though this calculation 
is mathematically correct, the result was unacceptable, therefore several 
mathematicians suggested more acceptable modifications. 

(i) Buffon, Cramer and others suggested accepting the natural assump- 
tion of limited resources (i.e., only limited amount of money available 
for the bank). Let this amount of money be one million dollars. Then the 
expected value of the player's gain is 


1 1 AM 
НАНЕ ОЬ 


inem 
кеше: "i 105 = 1941.90... ~ 21, 


(taking into consideration that 22% 105). Therefore if the player pays a 
21 dollar entrance fee then the game becomes somewhat favourable to 
the bank. 

(ii) W. Feller pointed out that it is possible to determine entrance fees · 
which would make the Petersburg game fair. Denoting by n the number 
of games the player played, the game can be considered fair if the ratio 
of the accumulated gain N, to the accumulated entrance fees R, con- 
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verges to 1 as n tends to infinity, more precisely if for every e>0 


РГ 


| as п +. (5 


Feller proved that the Petersburg game becomes fair if we put R,= 
=n-log.n. By the paradox the game cannot be fair for R,—cn, where 
c is any finite constant. If, however, entrance fees may depend on the 
number of games the player played then (according to Feller's theorem) 
the Petersburg paradox vanishes. 


d) Remarks 


(i) The relation (*) expresses a stability property of N,. Similar stabilities 
will appear with R,—cn in “The paradox of Bernoulli's law of large 
numbers". 

(її) On the results of 2084 games, Buffon found that the game becomes 
fair with about 10 dollars entrance fee. 

(iii) The following paradox is a companion to the St. Petersburg para- 
dox. (I heard it from Sam Gutmann after my talk in Dudley's seminar at 
MIT in 1983.) Say you are given an opportunity to win (—2)" dollars 
with probability 27", n=1, 2, 3, ... Are you happy or sad? The answer 
is you are both happy and sad. You are happy because the given lottery 
is equivalent to a compound lottery in which you receive one of a list of 
lotteries, each of which is favorable (has positive expectation). That is, 
you receive with probability (2-1+2-2+2-*) the lottery which awards 
you (一 2)7 dollars with probability 


DEI 


23112-33424 Gj = 1,2,4) 


or, with probability (27? 4-2-9--2-*) you receive the lottery which awards 
you (—2)* dollars with probability 
2-k 
52219-84528 (k = 3, 6, 8), 


etc. Each of these individual three-award lotteries has positive expecta- 
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tion. So you are happy. Of course you can also rewrite the original lottery 
into three-award lotteries each of which has negative expectation. The 
first has rewards (—2)!, (一 2)2 (—2)8, the second has rewards (—2)*, 
(—2)5, (—2)’, etc. So you are sad, too. [Here is a restatement for those 
who are familiar with the notion of conditional expectation. Imagine 
that the conditional expectation E(X|Y) is defined, not as usual, but 
rather as i xP(dx|Y) where P(dx|Y) is defined as usual. Then there 
exist random variables X, Y, and Z such that E(X|Y)—0-—E(X|Z) 
with probability 1! Simply let X be the ultimate reward in the lottery, 
ie., X-(—2) with probability 27". Let Y —1 if we receive the first 
positive lottery (i.e., X=(—2), (一 2)2 or (—2)*), let Y=2 if we receive 
the second (ie, X—(—2),(—2Y, or (-2)), etc. Let Z=1 if we 
receive the first negative lottery (ie, X—(—2),(—-2), (—2)), Z=2 
if we receive the second (i.e, Х=(—2)*, (-2), (—2)’) etc.] 
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8. THE PARADOX OF HUMAN MORTALITY. 
THE AGELESS WORLD OF ATOMS AND WORDS 


a) The history of the paradox 


The mathematical research on human mortality and life span began in 
the early days of capitalism due to the demands of the insurance compa- 
nies. Following the results obtained by John Graunt (1662), van Hudden 
and John de Witt (1671) in the 1660s, Edmond Halley (the discoverer of 
the comet named after him) published a paper in 1693 on mortality tables 
establishing the mathematical theory of life insurance. The following 
paradox (raised by d'Alembert) shows one of the “teething troubles” 
of the new theory. 
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b) The paradox 


In Halley's table the average life span is 26 years, and yet one still has 
an equal chance of dying before the age of 8 or living beyond the age of 8. 


c) The explanation of the paradox 


It is true that according to Halley's table one has an equal chance of 
surviving more than 8 years and dying before 8, but once he has already 
lived for 8 years he can still live for several decades. Therefore it is not 
surprising that the average life span is much more than 8. Supposing 
that out of a thousand people only one attains the life span of Methuse- 
lah, the average age will increase a lot but their probable age (which 
they survive at a chance of 50%) will not change significantly. 


d) Remarks 


(i) Let F(x) denote the probability that in a population the life span of 
a randomly choosen person is less than x time units. (F(x) is the distribu- 
tion function of life span.) Suppose that F(x) has a density function 


f (x). The average life span is M= uf xf (x)dx. On the other hand the 


0 


1 
probable life span m is defined by the equation F(m) ms In other words, 


during the time period m half of the population dies out. It is clear from 
these formulas that generally M and т are of quite different values. 
While M is the expected value of life span, m is called its median. 

(ii) The notion of human mortality can easily be extended. If we 
consider the amortization of industrial products or the decay of the atoms 
as death then we obtain a widely applicable mathematical theory devel- 
oped from the study of human mortality. However, in this more extend- 
ed field rather paradoxical phenomena may arise as well. While human 
beings are neither immortal nor ageless, we can find ageless beings both 
in nature and society. Let us define the notion of agelessness. Consider a 
being ageless if the chance that it will survive a certain fixed time interval 
is independent of the time it has already “lived”. Naturally, man does not 
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possess this feature for the longer he lives the more probably he will die 
in a given period of time. It is interesting that not every being imitates us. 
For example radioactive atoms are ageless beings. If the average life 
span of an ageless being is T then the probability that it will not die in 
the following time period x is e~*/7, where x is a positive number. The 
ageless property of radioactive particles follows from the fact that 
their speed of decay is in proportion to the number of undecayed par- 
ticles. The factor of proportionality is called the decay constant and is 
denoted by A. If there were №, undecayed particles at the moment t=0 
then (as the speed of decay is constant we get by integration that) at 
the moment x the number of the undecayed particles is N,=N,e~*. 
It means that the survival probability for the moment x is e~**. Conse- 
quently, the radioactive particles are really ageless and their average 


1 
life span is Т ae In other words, the life span of radioactive particles 


follows an exponential distribution with parameter A, i.e., its density 
function is е^“, The half-life of ageless beings (the period during which 
half of the beings die out) is the root of the following equation: 


ets P namely x S 


(iii) The half-life of ageless beings has become a fundamental idea 
in several fields of science. The radiocarbon method, worked out by the 
American chemist Willard Frank Libby, is still the most applied dating 
method in the field of archeological chronology. (The scientist was award- 
ed the Nobel Prize for this discovery in 1960.) In 1950, following Libby's 
ideas, M. Swadesh applied his method to linguistics assuming that not 
only radioactive atoms but lingual atoms, i.e., words can also be consid- 
ered ageless. The ancient basic vocabulary of languages dies out at a sup- 
posed half-life of 2000 years. With the help of this idea we can determine 
the date when two related languages (e.g., Latin and Sanskrit) separated. 
We only have to know the amount of basic vocabulary still existing in 
both languages to be able to figure out the date they separated. А. Raun 
and E. Kangsmaa-Minn compared Hungarian and Finnish. They found 
that the identical elements make 21% and 27%, resp. (The calculations 
were made by other methods.) On the basis of this, Hungarian and 
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Finnish are thought to have separated some 4—5000 years ago. Swa- 
desh's 30-year-old method is very often used and is known as lexicosta- 
tistics or glottochronology. (Swadesh's original article was published in 
the International Journal of American Linguistics.) 

(iv) Suppose the decay constant is A. Then the probability that exactly 
k particles will decay in a time period t is 


(ter^ 
k! і 


This means that the number of decays is а Poissonian random variable 
already known from the paradox of giving presents. The expected value 
of this distribution is At, which is quite natural. 

(v) We have seen that there exist ageless beings. What is even more 
surprising is the existence of beings growing younger, e.g., machines 
during their running in time, when the probability that they will not go 
wrong for a certain period increases with the passing time. It can easily 
be seen that mathematically this means that, using the notations of (i), 

f(x) 
1— F(x) 
of this rate is fundamentally important in reliability theory and in stor- 
age problems. 

(vi) Finally we mention a fascinating question related to human mor- 
tality. Can the total number of people who ever lived on the Earth be 
estimated by some probabilistic methods? The background of the follow- 
ing surprising statement is explained in Goldberg’s book (see below). 
“9 percent of everyone who ever lived is alive now.” This sentence was 
also the title of an article in The New York Times (Oct. 6, 1981. p. 61). 


(the failure rate) is a decreasing function of x. The examination 
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9. THE PARADOX OF BERNOULLI’S LAW OF LARGE NUMBERS 
a) The history of the paradox 


There are only few other laws in mathematics that have been as much 
misunderstood as the laws of large numbers. (It is not even generally 
known that several such laws exist.) The first law of large number was 
proved by Jacob Bernoulli (1654—1705) in his book entitled “‘Ars con- 
jectandi" (Art of Guessing) which was published only after his death 
in 1713. Bernoulli himself did not use the notion ‘‘law of large numbers", 
it was introduced only by Poisson in 1837. According to Bernoulli's 
law, if we toss a fair coin n times and it falls k times heads, then, by in- 
creasing the number of tosses (n), the rate k/n (the relative frequency for 
tossing heads) will approach the value 1/2. More precisely if e and 6 are 
arbitrary small positive numbers and n (depending on e and б) is great 
enough then |k/n — 1/2] is less than e with a probability of at least 1 — ô. This 
theorem is not nearly as complicated as one might think from the number 
of misunderstandings and paradoxes it caused. The most typical is as 
follows. 


b) The paradox 


Gamblers often believe that, according to the law of large numbers, if a 
fair coin falls heads many times then the probability of tossing tails 
will necessarily increase. (Otherwise it would not be true that after a 
great many tosses the number of heads and tails are approximately the 
same.) On the other hand, it is obvious that coins cannot remember and 
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so they do not know how many times they have already fallen tails or 
heads. For this reason in every toss the chance of heads is 1/2 even if 
the coin has already fallen heads a thousand times in a row. Is this not 
in contradiction with Bernoulli's law? 


c) The explanation of the paradox 


According to Bernoulli's law, the number of heads and tails must be 
approximately equal in the case of a great many tosses, but here the 
point is what is meant by "approximately". The gambler who believes 
that the difference between the number of heads and tails must be very 
small is mistaken for Bernoulli's law only states that the rate of the num- 


1 
ber of heads and the total number of tosses is approximately = (with 


a probability close to 1) or equivalently, the rate of the number of heads 
and tails approximates 1, in other words, the difference of their logarithms 
approximates 0 (provided that the number of tosses increases). If the 
difference itself should remain small, it would contradict the lack of 
memory property of coins. 


d) Remarks 


(i) It is clear now that no matter how many times we observe heads 
successively at the next toss the chance of tails will by no means be 
greater. The following question now arises. Let us suppose that we toss 
a coin n times. What is the longest run of heads only we can expect? 
Tossing a coin n times if n—100 then we can expect 6—7 heads succes- 
sively, if n—1000 then we can expect 9—10, and 19—20 for п= 10°. 
The following theorem was proved by Paul Erdós and Alfréd Rényi. 
If we toss a coin п times then there occurs a “риге head" run of length 
log, with a probability converging to 1 as п—<°. This fact is very 
useful in deciding whether a sequence of two signs describes the result 
of coin tosses or somebody has created it “carefully” avoiding long runs. 
Owing to the ingrained misunderstanding of Bernoulli's law of large 
numbers, most people would not write the same sign consecutively 7 or 
` more times in a sequence of 100 signs. 
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(ii) According to the above remark pure runs (heads or tails) may be 
rather long. On the other hand, it is easy to calculate that the expected 
length of the Ist, 2nd, 3rd, etc. pure runs is always equal to 2. If your 
coin is not fair and the probability of tossing heads is 0—p--1 then 
the expected length of the Ist, 3rd, ... (every odd) pure run is PAS 

p 
while the expected length of the 2nd, 4th, ... (every even purerun)is always 
2 (independently of p(!)). Thesum Z4 cannot be less than 2, which 

а р 


means that the odd runs аге at least as long as the even runs. This fact 
is by no means surprising because it is more probable for a coin to fall 
on its more probable side first. Thus the first run has a greater chance 
for being long than short, so on the average it is long or at least longer 
than the second run where the expected length is only 2. What is surpris- 
ing is its independence of p. 

(iii) Bernoulli's law of large numbers can be expressed concisely by 
the help of the notion of convergence in probability. We say that a series 
of random variables X,, Xs, ... converges in probability to a random 
variable X, if the probability of |X,—X|>e« converges to 0 for every 
positive e, i.e., if P(|JX,— X| — е) - 0. (Paradoxically, it may occur that a 

eries of random variables X,, X2, ... converges in probability to 0 but 


AX XE Y 
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does not.) Bernoulli's law says that the relative frequency k/n of an event 
converges in probability to its probability. To prove convergence in 
probability we generally use the Cebyshev—Bienaymé inequality, accord- 
ing to which if the expected value of X is E and its variance is D? then 


D? 


Itis interesting that the Russian P. L. Cebyshev and the French J. Bie- 
naymé published their inequality which they had discovered indepen- : 
dently in the very same number of a French journal. (J. Math. Pures et. 
Appl. XII. 1867). From this inequality it follows at once that if the distri- 
bution of the independent random variables X,, X3, ... is the same and 
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their variance D? is finite then the arithmetical mean 


di pei Y 
n 


converges in probability to the common expected value of XY; (the vari- 
2 


ance of this arithmetical mean is ——, which converges to 0 if п— ©). 
n 


This is one of the general (weak) laws of large numbers. The weak laws 
of large numbers examine convergence in probability while strong laws 
describe convergence with probability 1. The next remark concerns the 
latter. 

(iv) Among strong laws of large numbers the best known is Kolmo- 
gorov's theorem: if X,, Х, ... are mutually independent random vari- 
ales with the same probability distribution function having a (common) 
finite expected value E then the arithmetical mean 
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converges to E as n> with probability 1. If the random variables are 
positive and S® denotes the elementary symmetric polynomial 
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also converges to E (with probability 1) provided k=k(n) is a natural 
number so that k/n—0 as n>. The limit exists with probability 1 
even if k/n tends to a positive number c. Then this limit is a constant 
depending on c (if 0—c-1 then it is enough to suppose that the expected 
value of log (1 +X;) is finite, and the same holds for |log X;| if c=1; 
see the paper by Halász and Székely). If the random variables can take 
both positive and negative values then the problem is more complicated. 
(v) The “law of large numbers" is false in the sense of category (see 
the paper by Méndez). 


then 


4 Székely 
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10. DE MOIVRE'S PARADOX; ENERGY SAVING 
a) The history of the paradox 


One of the most outstanding figures of probability theory is Abraham de 
Moivre (1667—1754). He was a mathematician of French origin but 
after the revocation of the Edict of Nantes (which provided the Hugue- 
not's freedom of religion), he moved to England. His fundamental work 
“The Doctrine of Chances" was published there in 1718. In the third 
edition of the book (1756), de Moivre himself writes enthusiastically 
about his epoch-making discovery (already communicated to some of 
his friends in 1733), which proves much more than Bernoulli's law of 
large numbers: “... I'll take the liberty to say, that this is the hardest 
problem that can be posed on the subject of Chance...". There is no 
doubt that de Moivre's discovery, the normal distribution, has become a 
pillar of the science of chance. (Curiously enough, de Moivre did not 
incorporate it in the second edition (1738).) 
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b) The paradox 


According to Bernoulli's law of large numbers, in thecoin tossing game the 
probability that the number of heads the player scores is approximately 
equal to the number of tails tends to 1 as the number of tosses increases 
(approximate equality means that the ratio of the two numbers tends 
to 1). On the other hand, the probability that the number of heads is 
exactly equal to the number of tails tends to zero. For example, in 6 
tosses of a coin the probability of scoring 3 heads is 5/16; in 100 tosses 
the probability of scoring 50 heads is 8%; in 1000 tosses the probability 
of scoring 500 heads is less than 2%. Generally, when tossing 2n times, the 


the probability that it falls heads exactly n times is »-(7) | 2? and, 


for sufficiently large n, p is approximately 1/үлт, which really tends to 
zero as n increases. In sum: the probability that the number of heads 
approximately equals the number of tails tends to one, whereas the prob- 
ability that the number of heads is exactly equal to the number of tails 
tends to zero. The gulf between the two facts was surrounded by a 
“paradoxical atmosphere" till de Moivre succeeded in building a mathe- 
matical bridge over the gulf. 


c) The explanation of the paradox 


Let Н, and T, denote the number of heads and tails, respectively, in n 
tosses of a coin. According to Bernoulli's law of large numbers, the 
probability that H,— Т, becomes negligibly small compared to n tends 
to one (what is not at all surprising). De Moivre, however, noticed that 
the term |Н, T,| is not negligible compared to Vn. He calculated, for 
example, for n=3600 that the probability that |H,— T,| is at most 60 
is 0.682688.... Let x be an arbitrary positive number and let A,(x) 
denote the probability that |H,—T,|<x Үп. According to de Moivre, 
A, (x) tends to a value A(x) which is between 0 and 1 as n increases. When 
X begins to increase from zero to infinity, A(x) also increases steadily 
. from zero to one (see Remark (i)). This function A(x) is the above-men- 
tioned bridge over the gulf. To determine A(x), de Moivre used Stirling's 


4* 
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formula, which he had also discovered independently of Stirling. (James 
Stirling’s formula was proved in 1730 and states that n! is asymptotically 


equal to y2xn (") 3) 


d) Remarks 


(i) The exact form of the function A(x) is: 


A(x) = E f е“? ди. 
0 


Using this formula, de Moivre's theorem can be written is the following 
form: 
lim P(IH,—T,| = хүп) = AQ), if x0, 


or 
lim P(H,—T, < x Yn) = Ф(х), 
where 
1 x 
Ф X) 一 一 一 一 e- "^? du. 
(х) ee "f 


A random variable which assumes values smaller than x with probability 
Ф (х) (where x is an arbitrary real value) is said to obey a standard normal 
distribution. According to de Moivre's result, (H,—T,)/ Үп approxi- 
mately obeys a standard normal distribution (if is large enough). 

[Table 1 at the back of the book gives the values of ®(x).] 

(ii) Since Н,+Т,=п, the above result can be reformulated in the 
following way: 

yn 


А п 


De Moivre also examined the case where the coin was a biased one (not 
fair) and it fell heads with probability p, and fell tails with probability 
] —p. Then 


lim P(H, < np+xVnp(1— p)) = Ф(х), 


which is known as the “de Moivre—Laplace limit theorem". This theo- 
rem can be widely applied in a whole range of plannings, e.g., energetics. 


Classical paradoxes of probability theory 41 


ф(х) 


05 


-3 =2 =) 0 1 2 3 x 


Figure 3. The standard normal distribution function. 


expectation 


standard deviation standard deviation 


| 
| 
| 34.13% | 34.13 % 
1 і 


68.26% 


Figure 4. The normal density function. 


Example: Consider 300 similar machines in a factory. If, on average, 
70% of the machines work and 30% of them are under repair then power 
has to be provided for 210 machines, on average. Sometimes, however, 
all the 300 machines may be working. How much power has to be pro- 
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vided to be 99.9% sure that every machine will have enough power to 
work? (It is assumed that every good machines go wrong independently 
of each other.) In the above formula, H, now stands for the number of 
working machines, n=300 and p=0.7. According to Table 1, $(x)z 
240.999 if x=3. Using these values, np+x ynp(1— p) —2104-3 V63, 
so it is enough to take into account 234 machines. (In practice, however, 
nearly all the 300 machines are taken into account, being unnecessarily 
overcautious.) 

(iii) The de Moivre—Laplace theorem, discussed above, can be gener- 
alized in many ways. The members of the St. Petersburg mathematical 
school, led by P. L. Cebyshev (1821—1894), especially А. M. Liapunov 
(1857—1918) and A. A. Markov (1856—1922), gained great distinction 
for generalizing the de Moivre—Laplace theorem. Let Xj, X2,... be 
mutually independent random variables with a common distribution 
(i.e., with a common distribution function). Suppose that the expectation 
and the standard deviation of the random variables exist and are finite. 

Let M denote the common expectation and D the standard deviation 
of the random variables Xis X2,... and let S,=X,+X2+...+X,. 
Then 


lim P(S, < nM+xD үп) = Ф(х). 


п-> со 


This is the central limit theorem, the most important of all limit theo- 
rems (it is because of its importance that it is called “central”, a denomi- 
nation first used by George Pólya). In general, limit theorems discuss 
the asymptotic distributions of different functions (e.g., the sum, product, 
maximum, etc.) of random components. The central limit theorem — 
and its generalizations — explains why we meet the normal distribution 
in nature so often, especially in connection with quantities which can be 
composed from many (“nearly”) identically distributed (“nearly”) inde- 
pendent random components. However, it is worth emphasizing that the 
"composition" of random variables in nature is not always their sum, so 
the investigation of the behaviour of other functions of random vari- 
ables is very important. The known limit theorems do not explain com- 
pletely the frequent occurrence of the normal distribution. According to 
Poincaré's sarcastic remark, everybody believes in the universality of 
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normal distribution: physicists believe in it because they think that 
mathematicians have proved its logical necessity, and mathematicians 
believe in it because they think that physicists have verified it by labo- 
ratory experiments. 
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11. BERTRAND'S PARADOX 


a) The history of the paradox 


Georges Buffon (1707—1788), the famous French scientist, founded a 
new branch of probability theory with a paper written in 1733 (but pub- 
lished in 1777). The solution of the celebrated “needle problem" dis- 
cussed in this paper required a geometrical (rather than combinatorical) 
method. In these sort of problems the random points considered are 
supposed to be uniformly distributed in a given domain. (E.g., the 
bullets on a score-card.) The probability of falling into any part of a 
given domain is in proportion to its area (length or volume). Thus to 
calculate the probability we only have to compute the quotient of the 
“favourable” and the “total” area (length or volume). These kinds of 
probabilities also resulted in several paradoxes. E.g., the chance of hitting 
the very middle (or any other fixed point) of a score-card is obviously 0. 
On the other hand, it is not impossible to hit this point and therefore we 
must distinguish an event of probability O from the impossible event 
(the probability of the impossible event is 0 but the opposite is not true). 
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It also sounds very strange that both hitting at least one of finitely many 
points and hitting only one of them have the same probability. (Both 
probabilities are equal to 0. See the paradox of zero probability.) Another 
curiosity: a one-to-one transformation may completely change the 
chances. E.g., if we choose a point in (0, 1) at random then the chance 
that the chosen number is less than 1/2 is 5096, while if all the numbers in 
(0, 1) are squared and we choose from among these squares uniformly, 
the chance will be only 25%. Of course, the first answer, i.e., 50% is more 
reasonable. However, in other problems it might be more difficult to 
choose between reasonable and unreasonable. We have already mentioned 
(in the last remark to the first paradox) that such a choice is not always 
possible on the basis of pure logic excluding experience. Exactly this is 
the essence of the following paradox published in the book “Calcul des 
probabilités" (1889) by Joseph Louis Bertrand. 


b) The paradox 


Choose a random chord of a given circle and calculate the probability 
that this chord is longer than the side of the equilateral triangle inscribed 
in the circle. The paradox claims that this probability is not determined 
uniquely, i.e., different methods lead to different results. 


First method: 

Choose a point at random, uniformly in the given circle. This random 
point determines a unique chord whose midpoint is the randomly chosen 
point. This chord is longer than the side of our equilateral triangle if 
and only if the point is interior to the inscribed circle of the triangle. 
The radius of this circle is half of the original one, that is, its area is 1/4 
of the other. Consequently the probability that the randomly chosen 
point is in the inside of the inscribed circle is 1/4. So this method gives 
the answer 1/4. 


Second method: 

Due to symmetry, one end of the chord can be any fixed point on the 
circumference of the circle. So fix it to a vertex of the inscribed triangle. 
Choose the other end at random with uniform distribution. The vertices 
of the triangle divide the circumference of the circle into 3 equal arcs 
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and the random chord is longer than the side of the equilateral triangle 
if the random chord intersects the triangle. So the probability in question 
is now 1/3. 


Figure 5. Three ways for choosing the random chord. 


Third method: 

Choose a point at random, uniformly on a radius of the circle and take 
the chord which is perpendicular to the radius at this point. Then the 
random chord is longer than the side of the inscribed equilateral triangle 
if the random point belongs to the half of the radius which is closer to the 
centre. Due to symmetry, it does not matter which radius was originally 
chosen, therefore the probability is 1/2. 


c) The explanation of the paradox 


The different results were considered a paradox since it was believed that 
“the uniform random choice" uniquely determines the probability in 
question. The paradox points out that there can be different uniform 
choices, all of which are "natural" in a sense. Each of the 3 methods 
above uses a uniform distribution (in the circle, on the circumference of 
the circle, and on a radius of the circle). In Poincaré's opinion (Calcul 
des probabilités, Paris, 1912) if we do not have any preliminary informa- 
tion then we should accept the third method (were the result is 1/2) 
because this is the method which assures that if two sets of chords are 
geometrically congruent then there is the same probability that a ran- 
domly chosen chord belongs to one set or the other. The study of this 
kind of invariance led to a very interesting branch of mathematics called 
integral geometry. (This term was coined by Wilhelm Blaschke in 1934.) 
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The following invariance requirements also lead to probability 1/2 
(see Janes, E. T., “The Well-posed problem", Foundations of Physics, 3, 
477—493, (1973)). Let the circle have radius R. The position of the chord 
is determined by giving the polar coordinates (r, 9) of its center. We seek 
to answer a more detailed question than Bertrand's: What probability 
density f(r, 9) dA—f(r, 9)rdr9 should we assign over the interior 
area of the circle? Since the distribution of chord length depends only 
on the radial distribution, f(r, 9)—f(r). Thus the problem is reduced, 
determining a function f(r), normalized according to 


2r 


i Pere e lo 16 727 jecur Г. 


The scale invariance (i.e., the invariance under the change of scale) 
leads to the equation 


a?f(ar) = 2nf(r) Í f(wuduy 0-azs1, 0=г = К. 


Differentiating with respect to a, setting а=1, and solving the resulting 
differential equation, we find that the most general solution (satisfying 
the above mentioned normalizing condition) is 


gris 


IO) = ugs 


where q is a constant in the range not further determined by scale in- 
variance. Finally, if we translate the circle by a distance b the transfor- 
mation (r, 9)—(r', 9^) is given by r’=|r—bcos 9| and 


p p if r>b-cos$ 
“Т лт, if r<b-cos9. 


The translational invariance gives q=1. 
Thus we get 


; 1 
Л, 9) = 5—5 О.Р, 0= 9 = 2л 
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corresponding to the third method. Since a chord whose midpoint is at 
(r, 9) has a length L—2(R?—7?)!?, the probability density function of 


Lux 53 x 
Х=ук 15 255) О= = 1 


іп agreement with Borel's conjecture (Elements de 1а théorie des proba- 
bilités, Paris, 1909). 


d) Remarks 


(i) In discussing Bertrand's paradox, we have dealt with three methods 
of choosing a random chord but there exist many other natural methods 
as well. E.g., if we pick a point in the given circle at random and draw 
a chord of any direction through the chosen point (the direction is uni- 
formly distributed in the whole angular domain and independent of the 
choice of the point) then the probability in question is 


К 0.609... 


y 


It is not surprising that the result is more than 1/2 because this kind 
of selection prefers the longer chords. The probability is even greater 
(0.7449) if the random chord connects two random points of the circle. 
Another, less natural, method of choice is the following. Draw a con- 
centric circle (with radius r) to the given circle (of radius R) and choose 
a random point (uniformly distributed) in the circle of radius r. Draw 
a line through this chosen point with a direction uniformly distributed 
in the whole angular domain and independent of the chosen point. Now 
the question is the following. If the line intersects the circle of radius R 
what is the probability that the chord which was cut out of the circle of 
radius R is longer than the frequently mentioned side of the inscribed 
equilateral triangle? The answer comes easily. If r is increased gradually 


R CHER А 
from ЖЕ to r=æ, the probability in question decreases from 1 to 


p1/2. 
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(ii) The integral geometry developed from geometric probability has 
an increasing importance in many fields, e.g., in stereology, in the re- 
construction of 3-dimensional forms from their 2-dimensional sections 
or projections. Stereology is usefully employed in minerology, metallurgy, 
and biology (especially in tomography in the 3-dimensional reconstruc- 
tion of tumours). 
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12. A PARADOX OF GAME THEORY. 
THE GLADIATOR PARADOX 


a) The history of the paradox 


Though gambling has flourished in various forms from the time of 
Paleolithic men and the mathematical study of various games goes back 
to the Renaissance, it was only in the 20th century that the general theory 
of games (and its connection with other sciences like economics) evolved. 
In 1921 a mathematical theory of game strategies was first attempted by 
Emile Borel, but it was John von Neumann, the father of game theory, who 
proved the minimax principle, the fundamental theorem in game theory 
in 1928. (Earlier even Borel had doubted its validity.) 

The following paradox helps to understand the essence of the mini- 
max theorem. 
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b) The paradox 


Two children, R and Q, play the following wellknown children's game. 
They both put up one or two of their fingers at the same time; if the 
number of fingers they raised altogether is even, О pays К, and if it is 
odd, R pays Q thesame amount as the number of fingers raised altogether. 
The table (payoff matrix) below indicates the sum of money Q has to 
give to R. Though this game is generally considered to be fair [perhaps 
because the numbers shown in the table add up 0: 2 十 (一 3) 十 (一 3) 十 4 三 
=0], it is not fair at all: it is definitely favourable for О. 


Figure 6. 


c) The explanation of the paradox 


Obviously, if one of the players always puts up one finger or always two, 
then, having observed this, his opponent can play so that he always 
wins. Therefore only a “mixed strategy" can be advantageous, that is, 
in each trial the player has to choose at random, but with fixed probabili- 
ties, from the two possibilities (one or two of his fingers). Let us suppose 
that we have already determined the optimal strategies of both players, 
i.e., we know that the best strategy for R is to put up one finger with 
probability p, and put up two fingers with probability р, (clearly Pi 十 pa 一 
=1) and similarly for О the most favourable is to lift up one finger with 
probability q, and two fingers with probability qs (qı +q,=1). Since the 
two players decide independently of each other, the average amount of 
money О pays К (if both players have chosen the optimal strategy) 
is 

V = 2р; – 3р9 —3psdi + Apos. œ) 
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The game would be fair if V =0. We shall show, however, that p,—4,— 
А 


5 ] 
=p Pa= he 5 and then V= күз? which means that О wins, on the 


1 s Н Р 
average, S7 dollars in eàch trial even if R follows his optimal strategy. 


Substitute q;—1,45,—0 in (*). Then V=Q,=2p,—3p.. Similarly if 
4=1 and g,=0, then У=0,= – 3р, +ӣр,. Using this notation V= 
—qQ,01--q305. As V is the average loss of О, if he follows his optimal 
strategy, О,=Ё and Q,=V, hence V=q,Q,+9.0.=q,V+qV= 
—(qit43)V =V. 

Since neither q, nor д» can be equal to zero, it follows from the above 
relation that V=Q,=Q,, i.e., 2p,—3p,— —3p,--Aps, so (using р + 


7 5 1 E 
рг=1) Pi= 5° Pi and Foie Similarly 24, —3q,— —3q,4- 
5 


T 
+44 (q14-q5—1), and consequently di s а= 
Thus we have proved that the game is certainly not fair, and we have 
also obtained the optimal strategy. For both players it is advantageous 


7 
to raise one finger with probability 12 
Substituting 1 —p, for p; and 1—4, for q, in the formula (): V= 


1 1 
—]12piq1— 7p, — 7q, +4. For Р: И = TI regardless of the value 


e 7 1 
of q,; similarly for 4715: V= XS regardless of the value of p,. 
Accordingly, it makes no difference to a player how he plays if he knows 
that his opponent has chosen his optimal strategy. 


d) Remarks 


(i) The principal aim of Neumann's research in game theory was to 
find the optimal strategy of a game in which m players take part. We 
assume for simplicity that m=2 (i.e., only two players play against each 
other) and that the game has the zero-sum property (i.e., the loss of 
the first player is equal to the gain of the second player). Let S, and S; 
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denote the sets of the first and the second player's pure strategies, re- 
spectively (a pure strategy is a rule which determines the first player's 
first step and the replies to all the possible steps of the opponent). Let 
L(s, 55) be a bivariate function which gives the loss of the second player 
if he follows the pure strategy 5,65, and the first player follows 5,65, 
(the table on page 49 shows a function of this type). For a cautious 
player the best strategy is the one which minimizes his maximum loss 
(which occurs at optimal defence). The first player can manage to win 
a gain 
V, = max (min L(5,, Sa)) 
Sy 2 


anyway, and the second player 


V, = min (max L (s, s;)). 
25 si 


(Naturally V, or V, may also take negative values indicating actual loss.) 
In the case where V,=V, and the sets of possible strategies are finite, it 
is useful for both players to choose the strategies 57,5; for which У, = 
=»=1.(зү, 5). A strategy-pair such as (sj, 5») is the saddle-point of 
the game, but it does not always exist. Neumann, however, had the 
brilliant idea of extending the set of possible strategies and introduced 
“mixed-strategies” which choose randomly from pure strategies. Thus a 
mixed strategy is a probability distribution on the set of pure strategies. 
(In the children's game example the mixed strategies of the two players 
were pı, p; and q1, qs, respectively.) Mixed strategies eliminate the possi- 
bility of a player “seeing through his opponent", but it introduces chance, 
even in games where the rules themselves do not depend on chance. Nat- 
urally, if we want to find the optimal mixed strategies then we have to 
define the loss function on the set of pairs (%1, л) of mixed strategies. 
Let L(x, л) be the average loss that the second player pays the first 
one if they choose the mixed strategies 7,€P, and z,cP,. Neumann's 
minimax theorem (the fundamental theorem of game theory) states that 
if S, and S, are finite sets, then 


max тіп L(z,, л,) = min max L(1,, 75), 
7,€ P, 1, € P, 7, € P, 1,€ 
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i.e., there always exists a saddle-point in mixed strategies, thus optimal 
mixed strategies exist for both players. 

The general model of game theory can be used to examine conflicts 
appearing in other fields of life, too. E.g., from a mathematical point 
of view commercial competition can be considered a “game” in which 
both players want to find their optimal strategies. Since it is less and 
less likely that rivals could swindle each other permanently, compromises 
(corresponding to saddle points) are becoming more and more important 
in many fields. Game theory brought a new aspect into mathematical 
statistics, too, mainly due to Abraham Wald. The following remark 
shows a few applications of game theory in statistics. 

(ii) A typical problem in statistics is the estimation of the unkown 
parameter 960 of a probability distribution Fg, on the basis of the 
(usually independent) Fg distributed observations X;, Xo, ..., Xps i.e., 
the sample (O is an arbitrary set usually consisting of numbers or vec- 
tors). Consider a bivariate function L(9, с) the values of which mean our 
loss in the case where we estimate the unknown parameter 9 with the 
value c. It is quite natural to assume that the greater the deviation 
|9 一 cl is, the greater the loss becomes, thus L(9, c) is typically a mono- 
tone increasing function of |9—с|, for example, L(9,c)=|9—c|?, 
where ó—0. An estimator ĵĝ=f (X1, Xo, ..., Х,) is good if the average 
loss is small, i.e., if the risk function R(9, $)- E(L(9, $)) is small. 
Comparing two estimators, however, the value of the risk function of 
the first estimator at particular values of 9 can be smaller than 
that of the second estimator, while at other values of 9 the situa- 
tion is quite the opposite. A comparatively wide range of estimators have 
a risk function which can be decreased at some values of 9 only if at 
other values of 8 it increases. These kinds of estimators are called admis- 
sible estimators, that is, an estimator 9, is admissible if the inequality 
R(9, $)Z R(9, $) holds for all 9cO if and only if R(9, 9)=R(9, 8o) 
for all SCO. Only admissible estimators are worth using because for a 
non-admissible estimator we can always find another estimator the risk 
function of which is nowhere larger and definitely smaller at certain 
points than the risk function of the non-admissible estimator. If we want 
to find an admissible estimator which minimizes the average loss at the 
"worst" actual parameter value (where the risk function takes its maxi- 
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mum), then we obtain the minimax estimator. This cautious estimator 
is defined as follows: an estimator 9* is called minimax if 


sup R(9, 9*) = inf sup R(9, 9), 
SEO § sco 


where 9 runs through all the possible estimators. The “minimax-aspect” 
of mathematical statistics considers estimation a game in which Nature 
“chooses” a parameter 9 and we choose an estimator ĝ. The aim of the 
game is to make the average loss as small as possible. The average loss 
can be made less by allowing mixed strategies, when Nature chooses the 
parameter 9 randomly from © with distribution t and we also choose 
the estimator randomly with distribution « from the set of all possible 
estimators. In this case the risk function is r(t,a)=E(R(T, A)), where 
T is a t distributed random variable on the parameter set O and A has 
the distribution « on the set of all possible strategies. The minimax 
theorem remains valid for risk functions of this kind under quite 
general conditions: 


sup inf r(t, а) = inf sup r (t, a). 


Since the distribution t is unknown, it is useful to choose a mixed mini- 
max strategy a* as an estimator for which the equation 

sup (Е. ау = inf Sup r (t, o) 
holds. 

(iii) The following gladiator paradox comes from К. S. Kaminsky, 
E. M. Luks and P. I. Nelson. In a contest, called the gladiator game, suppose 
that two teams of gladiators are to do battle in the arena. In successive 
rounds a gladiator is selected from team 4=(А,, Ag, ..., A,) to meet a 
gladiator selected from team B=(B,, Bs, ..., В,). The victor returns 
to his team with undiminished vigour to fight again, if needed. The looser, 
presumably disabled, is removed from the tournament. Individual match- 
es are assumed to have a stochastic component and represent mutually 
independent trials where we let 0— P(A;, B;)<1, denote the proba- 
bility that gladiator A; defeats gladiator B;. The matches continue 
until one team is eliminated. We investigate the existence of strategies S 
which are optimal in the sense of maximizing P;(A, В), the probability 
that team 4 defeats team B when strategy S is used. A strategy here is a 
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rule which decides the order in which gladiators from both teams enter 
the arena. (Only the current composition of the teams can be used 
in formulating thestrategy at each stage of the game.) In a special glad- 
iator game let a, de, ..., An, bi, Da, ..., b, be positive strengths assigned 
to 4i, As, ..., A,, By, Bo, ..., B,, respectively, such that for all contests 
A, vs Bj, P(A;, Bj)-aj/(a;- bj. Then the probability Ps(A, B) is the 
same for all S! This is the gladiator paradox. Another paradox of this 
game is the following. Say that A dominates B if P(A, B)>1/2. Now if 
А dominates В and B dominates C then А does not necessarily dominate 
C. There are examples showing that m=min {(P(A, В), PC): 
P(C, 4))}> 1/2, though the upper limit of т is an intriguing open ques- 
tion. (A related paradox is I/13f.) 

(iv) A game theoretical paradox is the famous ‘‘prisoner’s dilemma". 
Here we only refer to the paper by Brams, Straffin, and Hofstadter. 
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13. QUICKIES 
a) The paradox of "almost sure” events 


Consider two random events with probabilities of 99% and 99.99%, · 
respectively. One could say that the two probabilities are nearly the same, 
both events are almost sure to occur. Nevertheless the difference may 
become significant in certain cases. Consider, for instance, independent 
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events which may occur on any day of the year with probability p = 99%; 
then the probability that it will occur every day of the year is less than 
Р=3%, while if p=99.99% then P=97%. 


b) The paradox of probability and relative frequency 


The following story, from George Polya, shows how not to interpret 
the frequency concept of probability. D. Tel (doctor of teleopathy) 
shook his head as he finished examining his patient. “You have a very 
serious disease," he said, “of ten people who have got this disease only 
one survives". As the patient was sufficiently scared by this information, 
D. Tel began to console him: “But you are very lucky sir, because you 
came to me. I have already had nine patients who all died of it, so you 
will survive." 


(Ref.: Polya, G., Patterns of Plausible Inference, Vol. II, Princeton Univ. Press, 
1954, p. 101.) 


c) Coin paradoxes 


(i) We toss a fair coin until we score two heads (HH) or a head and a tail 
(HT) in succession. Obviously the probability that (HH) will occur sooner 
than (HT) is equal to the probability that (HT) will occur sooner since 
after tossing a H the coin still falls Н or T with equal probability. In 
spite of this fact more tosses are necessary, on average, for (HH) than 
for (HT) to turn up. (HH) occurs in 6 tosses and (HT) in 4 tosses, on 
average. [Let М denote the expected value of the number of tosses we 
need to score (HH) assuming that H has occurred in the first toss, and 
let Мт denote the expected value of number of tosses necessary to score 
(HH) assuming that T' has occurred in the first toss. Then My=1+ 
+(1+M,)/2 and Mz—1--(Mg-- Mz)I2, and it follows that (Mg + Mr)/ 
/2=6, ie. the average number of tosses necessary to score (HH) is 
indeed 6.] The contrast is sharper if we compare the sequences (HTHT) 
and (THTT). The probability that (HTHT) will appear sooner than 


9-1 
. (THTT) is Tp Cd but the average number of tosses we need to score 


5% 
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(HTHT) is still greater than that of the tosses necessary to get (THTT). 
(The former is 20 whereas the latter is only 18.) Thus even if the proba- 
bility that the event 4 will occur sooner than В is larger than the 
probability that B will occur sooner, we may still have to wait more, 
on average, for 4 than for B. 

Incidently, it can be proved that among the Н— Т sequences of length 
n the pure sequence has the longest expected waiting time (i.e., which 
consists either of n H's of n T's). In this case the expected number of 
tosses is 2^*! —2. The smallest possible (average) number of tosses is 2", 
which occurs when we want a sequence consisting of n —1 H's in succes- 
sion followed by one T, or n—1 T’s followed by one H. (Thus we have 
to wait almost twice as long for the head run of length п than for the 
sequence of n— 1 H’s and one T, although the probability that the former 
will appear sooner equals the probability that the latter will appear 
sooner.) 

Determining the length of the time-interval, we have to wait for a giv- 
en H—T sequence of length n to appear, usually requires cumbersome 
calculation (solving multivariate linear system of equations) for large n. 
Calculations can be considerable simplified by using the “magic” Con- 
way algorithm, which is discussed in the article by Li quoted below. 
We shall give it now in a more general form not only for fair H—T 
sequences. Let X, X1, X2,..., be independent, identically distributed 
random variables assuming only a finite or countably infinite number of 
values with positive probabilities. Denote the set of these values by 
V and let A=(@,, a,...,a,) and B-—(b,,b,,...,b,) be two (finite) 
sequences whose elements are from V. Introduce the foliowing notation: 


0 otherwise 


and A -B=dyy doe ... dnm t dida ni 十 十 da Let Т, and Т `' 
denote, respectively, the number of X-variables until the first occurrence 
of the seqeunce A and B in Xi, Х», .... Then the expected value of Т, 
is A ‘A, while the probability that Т, is smaller than Т» (assuming that 
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neither 4 nor B contains the other as a connected block) is 


В.В—В.А 
А-.А+В.В—А.В—В.А` 


For example, in a fair Н — T' sequence d;;—0 or d,,=2; if A=(HTHT) 
and B=(THTT), then A-A=20,B-B=18,A-B=10 and B-A=0, 
Ww" was 19 
so the probability that 4 occurs sooner than B is oon UTE as we have 

mentioned above. 

Finally we state a more sophisticated (still unpublished) theorem. If 
we wait until all the possible 2" H — T sequences of length n occur (tossing 
a fair coin) and t, denotes the (random) waiting time then 


lim P(t,/2"—log 2” < x) = e-*™. 


(ii) A fair coin has to be tossed, on average, at least 8 times if we want 
a given sequence of length 3 (e.g., HHT). The number of necessary 
tosses is the smallest if we want to score any of the following sequences: 
(HHT), (THH), (TTH), (HTT). (In each case the average number of 
necessary throws is 8, whereas in any other case it is more than 8.) 
Compare these sequences in the following way: 


1 
х) the probability that (HHT) will occur sooner than (THH) is T 
1 
8) (THH) will appear sooner than (TTH) with probability 3° 
1 
y») (ТТН) will occur sooner than (HTT) with probability 4 and finally 


1 
6) the probability that (HTT) will occur sooner than (HHT) is 3° 


Thus, having started from the sequence (HHT), we have reached (HHT) 
again, though in each step the comparative probabilities were strictly 
less than 1/2. 


(Ref.: Li, Shou-Yen R. “A martingale approach to the study of occurrence 
of sequence patterns in repeated experiments", Annals of Probability, 8, 1171- 
1176, (1980).) 
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d) The paradox of conditional probability 


Events A, B and C exist such that 

х) the conditional probability of A given B is smaller than the condi- 
tional probability of 4 given that B has not occurred; 

8) the conditional probability of A given that both B and C have occurred 
is larger than the conditional probability of 4 given C has occurred 
but B has not occurred; 

y) the conditional probability of 4 given B and the complement of C is 
larger than that of 4 given that neither B nor C have occurred. 

Using symbols the three statements can be written as follows: 

P(A|B) — P(A|B), P(A|BC) — P(A|BC) 
and 
P(A|BC) — P(A|BC). 

This seems to be paradoxical because one might think that P(A|B) 

is the average of P(A|BC) and P(A|BC) and, similarly, that P(A|B) 

is the average of P(A|BC) and P(A|BC), and the average of two smaller 
values must be smaller than the average of two larger values. The expla- 
nation of this misconclusion is that P(A|B) and P(A|B) are the weighted 
average of the above mentioned probabilities but the respective weights 
are different in the two cases: 

P(A|B) = P(C|B) P(A|BC)+P(C |B) P(A|BC), 
whereas Е ^ " "T iex 

P(A|B) = P(C|B) P(A|BC)+P(C|B) B(A|BC). 


Nevertheless if the events B and C are independent then P(C|B)— 
— P(C|B) and P(C|B) — P(C|B), so in this case the paradoxical phenom- 
enon cannot occur. 


(Ref.: Blyth. C. R. “Оп Simpson's paradox and the sure thing principle." J. 
Amer Statist. Assoc. 67. 364-366. (1972).) 


е) The paradox of random waiting times 


Two random events occur after a (random) time X and Y. Paradoxically, 
it may happen, that X>Y with a probability of at least 99%, but X 
is stochastically smaller than Y, i.e., the probability of X<t is larger 
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than the probability of Y<t for any fixed time t (or, in other words, 
the distribution function of X is everywhere larger than that of Y). E.g., 
if Y is uniformly distributed in the interval [0, 1], X=Y 4-(1 — Y)/1000 
with probability 99% and X —Y/1000 with probability 1%. [This para- 
doxical situation cannot occur if X and Y are independent: let F and 
G denote their distribution functions; for simplicity assume that G is 
continuous and its inverse function G^! exists. Then the distribution 
function of the random variable Z=G~1(F(X)) is also G. Since ЕС, 
Z- X, hence 
P(X >Y) = Р(2>Ү)=-- 
as Z and Y are identically distributed independent random variables, i.e., 
Р(Х> Ү) must be much smaller than 99%, in fact, not more than 50%]. 
The following paradox is similar to the preceding one. Let X and Y 
be two independent random variables such that X is stochastically smal- 
ler than Y. Then one might think that max (X, X 4- Y) is also stochastically 
smaller than max (Y, Х+ Y), but that is not true, for example, in the 
case where X and Y both may assunie only the values — 1, 0 and 1 with 
1 1 


1 
probabilities У апа amo a respectively. 


(Ref.: Blyth, C. R., *Some probability paradoxes in choice from among random 
alternatives" (with comments by D. V. Lindley, I. J. Good, R. L. Winkler, and 
J. W. Pratt), J. Amer. Stat. Assoc., 67, 366—388, (1972). See also SIAM Rev., 
April 1970.) 


f) The paradox of transitivity 


Two players, 4 and B, are playing the following game. In the first step 
A numbers 3 dice to his taste, writing one of the numbers 1, 2, 3, ..., 18 
on each face of the 3 dice (he must use each number only once.) 

In the second step В scrutinizes the 3 dice (numbered by A) and 
chooses one of them. 

In the third step 4 chooses one of the remaining 2 dice. In the last 
step both 4 and B throw their dice and the player who scores the larger 
number wins. 
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One might think that this game is more favourable to B, because no 
matter how 4 numbers the 3 dice B can always choose the best one (or 
one of the best ones), thus В has the chance of at least 50% of winning. 
But, paradoxically, just the opposite is true: 4 can number the dice so 


21 
that he wins with probability 36 (which is more than 50%), no matter 


which dice B chooses. This is because of the “round defeat" numbering 
system where each dice defeats exactly one of the other two, which means 
there is no “best” among the dice. Let I, II and III denote the three 
dice and suppose that 4 numbered the dice in the following way. He 
wrote the numbers 


18 10, 9, 8,7,5 on the faces of dice I, 
16619565372 on the faces of dice II, 
and 14,13, 122 11,6. 1 on the faces of dice III. 


It can be easily calculated that we get a larger number with dice I than 


with dice II with probability 36° similarly, a larger number is scored 
21 
with dice II than with dice III with probability ae: and the probability 


21 
that a larger number appears on dice III than on dice I is also aa 


21 
therefore the “round defeat" probability is 36° So if A numbers the dice 


this way, he is in a more favourable position than B. (If B chooses the 
dice I, II or III and, accordingly, A chooses the dice III, I or II, А has 
more chance of winning.) It can also be proved that the probability of 


21 
"round defeat" cannot exceed 36 The point of this paradox is that ran- 


dom quantities may not be ordered according to which one is larger 
than the other with a probability of more than 5096, because the transi- 
tivity does not hold. If the same number may be written on more than 
one face of the dice, e.g., 
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1; 4, 4, 4, 4, 4 on dice ps 
v AX ee leah as een on dice II, 
and 3303737926 оп dice ПІ, 


21 
then the probability of “round defeat" is also 36 Formulate the para- 


dox more generally. Let X;, Xz, ..., X, be arbitrary numbers depending 
on chance (i.e., random variables). Denote the probabilities of the 
events X,—X, X2<X3; ...; X,<X1, by Pi, рә, << Pas respectively, 
then тіп (р,, рз, ..., рь) is the probability of “round defeat". Let К, 
denote this probability. The larger k, is the sharper the paradox becomes; 


7 一 ] eae 
it can easily be shown than k, can never exceed and this is the 


least upper bound. The calculation of the least upper bound of k, 

is more difficult if the random variables X;, Xs, ..., X, are supposed to 

be independent as the outcomes of rolling dice. Let f, denote the least 

upper bound in this case. Usiskin calculated (Annals of Statist., 35, 

y5-1 
2 


1 
857—862, 1964) that h=—, h= (the ratio of golden section), 
2 


2 : 
туз» etc. The sequence {f,} increases monotonically and converges 


3 : 
to 4' One can also show that the speed of convergence is of order n^?, 


g) The paradox of measurement for regularity of dice 


In dice throwing the same face will appear twice in succession in 7 throws, 
on average, and three times in succession in 43 throws (see the end of 
this paradox). If the dice were biased (1.е., different faces appeared with 
different probabilities), the average number of necessary throws to get 
the same face twice or three times in succession would be smaller. We 
shall call dice I more regural than dice II if dice I has to be thrown more 
times on the average to get the same face twice (or three times) in succes- 
sion than dice II. Paradoxically, more throws may be necessary on aver- 
age with dice I than with dice II to score the same face twice in succes- 
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sion, but to score the same face three times in succession, dice II has to 
be thrown more times. 

The following simple example comes from T. F. Móri. Suppose that 
each face of dice I can be scored with probabilities 0.03; 0.03; 0.19; 0.19; 
0.28; 0.28 and let the corresponding probabilities with dice II be 0.04; 
0.04; 0.17; 0.17; 0.29; 0.29. Then dice I has to be thrown 5.41 times and 
dice II 5.47 times on the average to score a face twice insuccession, whereas 
if we want to score a face three times in succession then we have to 
throw dice I 22.54 times and dice II 22.35 times on an average. This par- 
adox shows that it is not expedient to define the “regularity” of a dice 
as we did. (In general it can be proved that if a particular face of a dice 
appears with positive probability p, then the average number of necessary 
throws to score this face k-times in succession is m,=p~*+p~?+...+p—*. 
Consider a dice whose faces appear with probabilities ру, ps, ... and let 
M, denote the average number of throws we need to score the same face 
k-times in succession. Then M, — m," +m, +... If we put p, pi... 


1 
ет then M,=7 and M,=43 as we have already mentioned.) 


h) The birthday paradox 


If not more than 365 people come together, it is possible that everybody 
has a different birthday, while with 366 persons it is certain (100%) that 
at least two of them were born on the same day of the year. (Let us 
ignore the existence of leap years here.) However, if we aim at 99% cer- 
tainty, then, surprisingly, 55(!) people are enough to claim that there 
will be two among them having the same birthday, while for 68 people 
the probability that at least two of them have the same birthday is 99.9%. 
It is almost unbelievable that such a small difference between the proba- 
bilities 99% and 100% can lead to such a big difference between the num- 
ber of people. This paradoxical phenomenon is one of the main reasons 
why probability theory is so wide-ranging in its application. (A similar 
phenomenon was mentioned in I. 10 Remark (ii).) 

Denote by n the number of days in a year and by x (<n) the number 
of people in a group. The probability that no two people in this group 
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have the same birthday is then 


n(n—1)(n—2)...(n—x+1) 
т ; 
Therefore if 


Bor ores: (n—x+1) drap 
n 

then p is the probability that among x people there are some having 

the same birthday. The approximate solution of this equation (provided 

that 0<р<1) is 


Xz ү 2n n : ; 
Pp 


Hence the order of magnitude of x is Vn for any value of p in the open 
interval (0, 1), while for p=1 x=n+1. A generalization of the birthday 
problem is the following. Calculate the lower bound x so that in a group 
of x people there be at least k who have their birthday on the same day 
of the year with probability p. Here the result is 


x x cn*-Dik 


where c is a constant depending only on p and К (more precisely c— 


= (cn 1 ү 
I-p 


i) The paradox of heads and tails 


Suppose we are playing heads or tails with a fair coin and we toss it 100 
times .Then, surprisingly, the probability of the event А = (we toss exactly 
50 heads} is bigger than the probability of B= (we toss at least 60 heads}. 

[As we have mentioned іп 1.10, P(A)~8%, while according to the 
Moivre—Laplace theorem Р(В)=1— Ф(2) 3%. The chance of tossing 
at least 55 heads, however, increases to about 1696, which is the double 
of P(A).] 
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j) The edge of the coin 


Generally the occurrence of a coin falling on its edge is left out of consid- 
eration since this event almost never occurs. Calculate now the size of a 


1 
coin that ensures the same (5) probability for falling heads, tails, апа 


edge. For simplicity, consider the coin a flat cylinder whose bases are the 
heads and tails and the nappe is the edge. If the coin is spun around an 
axis which goes through the centre of the coin and is parallel to its bases 
it is enough to consider a planar section of the coin which contains the 
centre of the coin and is perpendicular to both bases. This section is a 
rectangle. Draw a circle around this rectangle and choose a landing 
point at random on its circumference. 
It is reasonable to suppose that the 
coin falls on its edge with a proba- 
bility which is equal to the chance 
that the radius connecting the centre 
and the random point on the cir- 
cumference intersects the side of the 
Figure 7. When does a coin fall on rectangle which corresponds to the 
its edge? nappe of the coin. In this model the 

coin falls on its edge with probabil- 
ity of 1/3 if the rate of its thickness and diameter is equal to tg 30? ~ 0.577. 
The problem is not reduced to a planar one if the coin may turn round 
freely, more precisely, if the random point is chosen on the surface of 
the sphere drawn around the coin and we suppose that the coin falls 
on its edge if the radius connecting the random point and the centre 
intersects the nappe of the coin. In this model, tossing an edge will have 
the same probability as tossing heads or tails if the rate of the thickness 
and the diameter is 0.354.... There are, of course, more realistic models, 
too. The most surprising of them is the one where the above rate is the 
least (1.e., where the coin is the flattest). 
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k) Borel's paradox 


Let a random point be chosen uniformly on the surface of a sphere (e.g., 
on the Earth, supposing its form is a sphere). The position of a point is 
generally given by its longitude and latitude. Given a latitude, the longi- 
tude is uniformly distributed, but given a longitude the distribution of the 
latitude is not uniform. (Its density function is proportional to the cosine 
of the latitude.) Consequently, the distribution of the random point is 
not the same if we suppose that it is on the equator or on the Greenwich 
meridian, though both the equator and the meridian are great circles 
on the globe and therefore their role seems to be symmetric. 


Figure 8. Though both the equator 
and a meridian are great circles on 
the globe, when calculating con- 
ditional probabilities, one should 
take into account the fact that 
while the equator is surrounded 
by spherical zones, a meridian is 
surrounded by biangles. 


* 


The next problemisa similar paradox. Let X and Y denote two independ- 
ent normal distributions. (X, Y) can be considered a random point on 
the plane. Let R and ф be its polar coordinates. Supposing that X—Y 
the distribution of R?—2X? is the same as the distribution of the square 
of a standard normal random variable multiplied by 2. At the same time, 


л 5л 
under Ше condition ф Е ог Фе the distribution of R?— X?-- Y? 


is the same as that of the sum of the squares of two independent standard 
normal random variables (since R and @ are independent). Hence we 
get completely different distributions for R? in the case X=Y and in 


T 5л А 
{ће саѕе Pes Or та which seems to Бе a paradox, because the 
two conditions mean just the same, only in the first case it is formulated 


by usual coordinates and in the second case by polar coordinates. 


(Ref.: Billingsley, P., Probability and Measure, Wiley, New York—Chichester— 
Brisbane—Toronto, 1979). 


66 Chapter 1 


1) A paradox of conditional distributions 


Let X and Y be random variables and f(x, y) a function of two variables 
such that for any fixed у the variable f (X, y) is independent of Y. Is it 
true that in this case f (X, Y) is also independent of Y? The following 
simple example shows that the answer is negative. Let X — Y be uniformly 
distributed on the interval (0, 1). And let f(x, y) zy if x=y and f(x, y) -0 
otherwise. Now f (X, y) is indetically 0 (with probability 1), therefore 
it does not depend on Y, while f(X, Y)=Y obviously depends on У. 


(Ref.: Perlman, N. D. Wichura M. J. A note on substitution in conditional 
distribution, Annals of Statist. 3, 1175-1179. (1975).) 


m) Winning a losing game 


Suppose that in a game the number of trials (n) is always even. The first 
player A has a chance p=0.45 to get a point; for B, itis p=0.55. To 
win the game, one of them has to collect more than half of the points. 
If A has the privilege of fixing n then paradoxically п=2 is not the 
best choice. (This would be the best choice if p were very small, i.e., 
less than 1/3). If p=0.45 and n=2 then the probability that A wins 
is only 0.45*—0.2025, but if A has more trials he gets in a more favour- 
able position. It is easy to prove that the optimal choice is n=10. This 
seems to contradict the general “principle” that the sooner we get out 
of a losing game the better. Suppose, e.g., that we need 20 dollars 
and have only 10. We want to win the missing sum by playing roulette. 
Since roulette is a losing game, it is adviceable to have as few tries as 
possible, i.e., we have to stake all of our money, e.g., on red. In this 
case the chance of winning is 18/38 (in American Roulette there are two 
zeros: 0 and 00). At the same time if we bet only one dollar in every 
trial we reach our aim with probability 0.11. For further details see 
Dubins, L. E., Savage, L. J., How to Gamble if You Must, New York, 
McGraw-Hill, 1965. 


(Ref.: Mosteller, F., Fifty Challenging Problems in Probability with Solutions, Read- 
ing, Addison-Wesley, 1965.) 
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n) The paradox of insurance 


A client, whose property is V, wants to insure a bV part of this property 
(0<b<1) against a damage occurring with probability p each year. 
The annual premium is cV (0<c<1). The insurance is effected by 
the company only if the expected value of its profit is positive, i.e., 
if c is greater than pb. Why do clients still insure if they know that it is 
profitable for the company and not for them. If the client insures and pays 
the money for n years, but the insurance company never has to pay then 
the client's initial property (V) will decrease to V(1—c)". And what 
happens if he does not insure? Let X, denote the random variable which 
is equal to 1 if the client suffers a loss in the kth year and let X,=0 
otherwise. In this case his property in the (k+1)th year will be У, = 
—V,(l—bX,,,) therefore after n years 

z ња 


ар 
k=1 


Since the expected value of In (1 — bX;) is p In (1—b), 
V, = Verna- = y(1—py» 


with great probability. Thus the insurance is favourable for the client 
if V(1—b)" is les than V(1—c)", i.e., (using the expansion in power 
series) if c is less than 


РОВ) 5, PO-PQ—P) ps 
2 6 

It means that the insurance is favourable for both the client and the 
company if c is greater than pb but less than the above sum. It is easy to 
see that the less b is (i.e., the smaller part of the client's property is insured) 
the less is the freedom in the choice of c, i.e., the possibility of compromise 
decreases. (In a sense, lottery is an insurance too; if somebody tips always 
the same numbers but stops playing after a while and his numbers 
were drawn afterwards, he would probably die of apoplexy. From this 
point of view, the prize of the tickets is really favourable. Football 
pools are another case for there are very few people who always give 
the same tips and therefore it is not obvious what he missed when he 
did not play.) 
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o) Absurdities, Lewis Carrol 


We will finish the series of quickies with absurdities and fallacies. We 
will mention problems together with their nonsense solutions, but to 
find the mistake in the reasoning may cause some brain beating. Lewis 
Carrol, the famous writer, was very fond of absurdities both in mathemat- 
ics and literature. (The Absurd Literature by Nicolae Balote considers 
Carrol the number one forerunner of modern absurdity.) In his last 10 
years, Carrol was attracted by mathematical absurdities (see the collec- 
tion of Curiosa Mathematica 1888 or the article of the Mind published 
in April 1895). In his Pillow Problems (1894) the following absurdity 
can be read. 

There are two balls in a bag, they are either red or white. Let us guess 
their colour without looking into the bag. According to Carrol, the 
only correct answer is that one of them is red and the other is white. He 
gave the following reasoning. If there were 2 red (R) and 1 white ( W) 
balls in the bag then the probability of drawing a red one would be 2/3. 
On the other hand, if there were 3 balls in the bag, and the probability 
of drawing a red one were 2/3, then there would be 2 R and 1 W balls 
in the bag. 


Now put an R ball into the bag that originally contained only two balls. 
1 
In this case there are four equally probable С) ball combinations: КАК, 


RWR, RRW and RWW. If the first combination is the actual one then 
the probability of drawing a R ball is 1, in the second and the third cases 


1 
this probability is 2/3, апа in the last опе it is 4' Therefore the proba- 


" 1 t l2 beta, 00961 ГР MT 
bility of drawing a К ball is 人 Thus 
there must be 2 R and 1 W balls in the bag, consequently there must 
have been 1 R and 1 W balls in the bag before we put a Ё ball into it. 
This result is obviously nonsense, so the reasoning must be false. But 


where is the mistake? 
The following reasoning results in an absurdity as well. Two of three 


prisoners, denoted by A, B, C, will be executed. They know this, too, 
but cannot be sure who the lucky third will be. A says: “the probability 
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that only I will not be executed is 1/3. If I ask the warder to tell me one 
of the 2 prisoners’ names (different from mine) who will be executed 
then there remain only two possibilities. Either I am the other one to be 
executed or not, and therefore my chance for survival will increase to 
1/2." However, it is also true that 4 knows even before the warder an- 
swers that one of his companions will certainly be executed and therefore 
the warder has not told him any essential information concerning his 
own execution. Why then has the probability of his execution changed? 

(The answer is very simple: the probability has not changed at all, it 
has remained 1/3. The prisoner failed to take into account that the warder 
says, e.g., B with the probability of 1/2 if B and C are going to be executed, 
while if A and B are the victims, this probability is 1. Consequently, A's 
actual chance for escaping execution equals the ratio of the probability 


1 
Outer] 
in the former case and that of the two cases together: Dp 
$3 
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Chapter 2 


Paradoxes in mathematical statistics 


“Statistics is the physics of numbers." 


P. Diaconis 
*Everything of importance has been “If one can tell ahead of time what 
said before by somebody who did not one's research is going to be, the re- 
discover it." search problem cannot be very deep 
A. N. Whitehead and may be said to be almost nonexis- 
tent. z 


A. Schild 


Originally statistics was “state arithmetics". (The word statistics comes 
from the Latin status=state.) Since ancient times statistics have been 
applied to inform state leaders about the amount of taxes they can levy 
on their people and about the number of soldiers they can count on in 
war time. In China, a census was taken more than four thousand years 
ago. According to the Bible, Moses also counted all the men over 20 in 
his tribe. The result was 603,550. The fourth book of Moses (Book of 
Numbers) contains many other census data, but they seem to be exagger- 
ated, as are the date of Athenaios giving the number of slaves in the Greek 
polices at the time of the Roman Empire. It is rather unlikely that there 
were 400,000 slaves in Athens and 460,000 in Corinth. We do not 
know how these census data have swollen, but it is a fact that, according 
to the census, the first city with more than a million inhabitants was 
Rome. England's first statistical document, the Domesday Book written 
in the 11th century, was also for purposes of taxation and the army. 
This is the reason why women have always been disregarded during a 
census right up until modern times. Statistics became a science only in 
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the 17th century. Its pioneers were John Graunt (1620—1674) and Sir 
William Petty (1623—1687). Graunt's “Natural and Political Observa- 
tions made upon the Bills of Mortality" (1662) was a demographic study. 
In 1669 Huygens published a life table based on Graunt's data. Petty's 
“Treatise on Taxes” (1662) and “Observation upon the Dublin Bills of 
Mortality” (1681) also used Graunt’s results and ideas. In his “The Poli- 
tical Arithmetic” (published posthumously in 1689) Petty compared 
England, Holland, and France on their population, trade, and shipping. 
The term “‘political arithmetics” can be considered as the forerunner of 
the word “statistics”. As capitalism advanced, not only state leaders 
but also capitalists became interested in statistical tables. More and more 
complicated mathematical means were used to process data, and their 
profit also increased, e.g., in the insurance business. Lloyd’s, one of the 
outstanding insurance companies in the world, was founded in the 
17th century, though at that time it was only a coffee-house in Tower 
Street in London. Good insurance is based on exact surveys and proper 
mathematical conclusions. Since the 17th century, mathematical statistics 
have gradually developed into an independent branch of mathematics. 
Its main purpose is to obtain as much correct and useful information 
as possible from the data, observations or measurements, in short from 
the statistical sample. (Measuring the amount of information apart from 
its concrete content developed into a new branch of mathematics only 
in the 20thcentury, and is called nowinformation theory. It is very close- 
ly related to mathematical statistics.) Not to write satire, at least in 
Juvenalis’ opinion, is hard, but not to find paradoxes in mathematical 
Statistics is even harder. According to a joke, in 1901, 33% of the women 
students of Harvard University married their tutors. Actually, at that 
time only 3 girls studied at the university, and one of them did marry her 
professor. The statement is true, though misleading. Suppose that in a 
certain country 20% more boys than girls are admitted to the universities. 
If all the candidates are equally qualified for entry and the number of 
boy candidates is the same as the number of girl candidates then the 
obvious conclusion is that admission committees give preference to boys. 
However, since more girls than boys want to study at the more popular 
faculties, where the refusal rate is higher, the result may be that despite 
proportional admittance, there will be more boys studying at university 
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than girls. L. P. Ayres’ 1913 text analysis is similarly misleading or at 
least itis easy to misinterpret it. He states that the 50 most frequent words 
make up about 5096 of a typical text, the 300 most frequent words make 
up 75%, while the 1000 most frequent words make up 90% of the text. 
In spite of this fact we should not conclude that if we know 50 or 100 
words of a language we already understand half of it, for the knowledge 
of articles, though they are frequently used, can hardly help in under- 
standing a text. No wonder many people believe there are three kinds 
of lies: white lies, damned lies, and statistics. We hope that the explana- 
tions of paradoxes in mathematical statistics will help us to see through 
statistical absurdities and to understandt the useful and essential conclu- 
sions of statistics as well as to find the most fundamental information. 


1. BAYES' PARADOX 


a) The history of the paradox 


A student of de Moivre, Thomas Bayes, was one of the most outstanding 
pioneers of mathematical statistics. His theorem discovered about 1750 
but published only after his death was the root of several controversies 
in statistics. Even today, the heat of the debate has not decreased. More- 
over, the theoretical gulf between Bayesianism and anti-Bayesianism is 
widening. А simple formulation of Bayes' theorem is the following. Let 
A and В be arbitrary events with probability P(A)>0 and P(B)=0, 
resp.; denote by P(AB) the probability of the joint occurrence of 4 and 
В and by P(A|B) the conditional probability of A if it is known that 
B has already been observed. Then 


P(AB) 


P(A|B)P(B) 
P(A)’ 


Р(В|А) = Pay 


ie. PRA) 
Therefore, if By, By, ... are disjoint events with positive probability and 
one of them always occurs (or at least with probability 1) then 


P(A|B,) P(B,) 


POMA) = BIB PB) + PIB, P(B)3 ` 
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This is the Bayes” formula. It shows how a priory probabilities P(B,) 
(the probabilities of B, before А was observed) determines the a posteriori 
probabilities (after 4 was observed). If the events B, are considered the 
reasons, then Bayes' formula is a theorem on the probability of reasons. 
The theorem itself is indisputable but in most applications the proba- 
bilities P(B,) are unknown. In this case, it is typical, though generally not 
acceptable, to think that the absense of previous information on the 
reasons B, implies the equality of the probabilities P(B,). Bayes applied 
his theorem in cases when a priori probabilities were of continuous 
distribution, especially when they were of uniform distribution on the 
interval (0, 1). According to the Bayes' theorem, if an event of unknown 
probability p occurs n times out of n+m observations then the probability 
for p to belong to a subinterval (a, b) of the interval (0, 1) is 


f ачах 
атак 


Bayes set out the idea that if we do not have апу previous information 
about p then the a priory probability density of p is uniform on the whole 
interval (0, 1). If, e.g., n=1, т=0, а=1/2 and b=1, according to the 
above formula, the chance is 3/4 that the event in question has a proba- 
bility more than 1/2. Still, few people would bet on the basis of this 
result partly because they doubt that the a priori distribution is uniform. 

The lack of knowledge of a priori distributions had such a damag- 
ing effect on the statistical conclusions of Bayes’ theorem that it has 
been almost excluded from the main line of statistics. In the second 
third of the 20th century, however, Bayesian conclusions were revived 
partly because of their essential role in finding admissible and minimax 
estimators (see I.12. Remarks and Ferguson’s book). The opinion also 
gained ground that successive applications of the Bayes’ formula (after 
each observation the a posteriori probabilities are calculated and used as 
a priori probabilities next time) reduce the importance of the original 
a priori distribution since after many repetitions the original distribution 
can hardly influence the final a posteriori distribution. (Obviously, cer- 
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tain degenerated cases are disregarded, e.g., when the value of p is 1/10, 
and the a priori distribution is uniformly distributed on the interval 
[1/2, 1] which does not cover the point 1/10.) 


b) The paradox 


Let the possible values of a random variable X be the integers and suppose 
that the probability distribution of X depends on a paramter p belonging 
to an interval [a,b]. If independent observations X,, X2, Xz, ... are 
made on the unknown distribution of X (i.e., on the unknown parameter 
p of the distribution; X; are of the same distribution as X) then one can 
expect that the series of a posteriori distributions (calculated from the 
originally uniform a priori distribution) concentrates more and more 
on the true value of p. Paradoxically, this is not always true. The true 
value of p may be, e.g., 1/4 but the series of a posteriori distributions (as 
more observations are made) concentrates more and more, e.g., on 3/4. 


c) The explanation of the paradox 


The paradox seems to be surprising because the a posteriori density 
function is expected to be the highest in the neighbourhood of the true 
value, i.e., around 1/4. This idea, however, does not contradict the fact 
that the a posteriori density functions can concentrate more and more 
around 3/4. What should be achieved is only that the density function 
which is too high at 1/4 should very quickly decrease but remain high 
around 3/4. If the number of possible values of X is finite then this 
situation cannot be achieved, whereas if X can take any integer number 
then the paradoxical situation may really occur. Let the a priori distri- 
bution of p be uniform over the interval [1/8, 7/8]. Now let us define a 
function f (p) on this interval in such a way that f (p) is always a natural 
number except if p=1/4 ог p=3/4 where f(1/4)-f (3/4)— + e». 
Let the distribution of the random variable X (depending on p) be the 
following: 


P(X = i) = e(l—p) p", i= 0, L2 ag) 
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where c=c, is a constant for which 


Jp) x 

2 c(1 — p)r! = 1. 

By a suitable choice of f (р), the above mentioned paradoxical situation 
becomes achievable. For further details see Freedman's paper. 


d) Remarks 


(i) S. Bernstein and К. Mises had already pointed out before 1920 that, 
under some conditions, when applying Bayes' theorem successively, the 
series of a posteriori distributions always converge to the actual distribu- 
tion whatever the a priori distribution was. That is why the a priori 
distribution has no significance asymptotically. According to the paradox, 
this conclusion cannot hold without any condition. 

(її) The subjective selection of a priori distributions raise the general 
question of whether unknown probabilities and probability distributions 
are objectively determined at all, independently of our observations and 
measurements, or they make sense only through our subjective informa- 
tion. In his monograph Bruno de Finetti, the head of the Italian school 
of probability theory, expresses that probability does not exist objec- 
tively, just as absolutespace and time, the cosmic ether, or the phlogiston 
do not either. “Objective probability" is nothing else than an attempt to 
exteriorize and materialize our probabilistic beliefs. In his opinion an 
event (e.g., tomorrow it will rain) either occurs or not (this is objective), 
and on the basis of the information available we can figure out its ‘“‘sub- 
jective” probability. The personal or subjective probability indicates the 
ratio of the bet we are willing to pay on the occurrence of the event. 
We can speak about subjective probability even if “randomness” is not 
objective. It has to be underlined, however, that the group of scientists 
claiming the existence of objective randomness and objective probability 
is much larger. Their conviction is the following: the objective probabili- 
ties of future events are encoded in the present state of the world. This 
kind of objective existence of probability has been expressed by the 
Nobel prize winner Max Born, who is famous for introducing objective 
probability into quantum physics. 
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2. PARADOX ESTIMATORS OF THE EXPECTATION 
a) The history of the paradox 


Equalization of contrasts and deviations in the “mean”, i.e., summariz- 
ing the observations into a single value has long traditions. Aeschylus 
writes in the Eumenides: “To moderation in every form God giveth the 
victory, but his other dispensation he directeth in varying wise..." 
and the followers of the Chinese philosopher Confucius said that “the 
immobility of the mean (=Chung Yung) is the greatest perfection". 
Mathematically, the notion of “теап” can be interpreted in many ways 
(arithmetical mean, geometrical mean, median, etc.) In the practice 
of statistics, however, arithmetical mean was extremely important for a 
very long time. The first outstanding results in probability theory and in 
mathematical statistics also explored and reinforced the importance of 
the arithmetical mean of statistical samples. 

Consider a set (F5) 9€O of probability distributions with finite expec- 
tation, where the parameter 9 is just the expected value of Fg. We want 
to estimate the value of the unknown parameter 9 on the basis of the 
observed data (i.e., sample), X,, Xe, ..., X,, (the sample elements X; 
are supposed to be independent, F, distributed random variables). The 
arithmetical mean 
X TX. +X, 

n 


§$=X = 


as the estimator of 9 has many good properties, e.g., itis always unbiased, 
that is, E(9)=9 for all 9E9 (i.e., the estimate fluctuates around the 
actual value). The laws of large numbers state that the estimator $— X 
is consistent, i.e., for any e>0 we have 


lim P(|$—9| = ғ) 2 1 forall 9c6, 


so the error of the estimate can be made as small as desired by taking a 
sufficiently large sample. Nevertheless there may exist many unbiased, 
consistent estimators of a parameter and it is useful to give preference 
to estimators (among these) which have smaller variance. The paradoxes 
here reveal that (except in the case of normal distributions) the arithmet- 
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ical mean of the sample is not the minimum variance unbiased estimator 
of the expected value. Moreover, if we do not insist on unbiasedness, 
then, even in the case of multivariate normal distributions, it is not 
always useful to estimate its expected value by the sample mean, because 
this estimator is not admissible with respect to the quadratic loss-func- 
tion. [For the definition of admissible estimation see 1.12. Remark (ii).] 
A similar paradox will be discussed in 13/q. 


b) The paradoxes 


(i) (Kagan—Linnik—Rao) Let F(x) be an arbitrary distribution func- 
tion with zero expectation and finite standard deviation and let F,(x)= 
—F(x—9) where the parameter 9 is an arbitrary real value. If the ele- 
ments of the sample X,, Xz, ..., X, are random variables with distri- 
bution Fs, then the sapmle mean X is a consistent and unbiased estimator 
of the unknown parameter 9 (which is obviously the expected value of 
the distribution F3). The estimator X, however, is not very efficient 
(except in the case of normal distribution): for any n>2, there exist an 
unbiased estimator the standard deviation of which is smaller than that 
of X (to be more precise, for all 9 its standard deviation is at least as 
small as that of X and for at least one 9 it is definitely smaller). 

(ii) (C. Stein) X is an exemplary good" estimator of the expectation 
of normal distributions: it is a minimum variance unbiased, consistent 
estimator, admissible in respect of the quadratic loss-function L(9, c)— 
=(9—с)?, and also minimax. This is exactly why, some 20 years ago, 
C. Stein's discovery—claiming that in the case of multivariate normal 
distributions the estimator corresponding to X is not admissible—came 
as a surprise. More specifically, consider probability distributions defined 
on the k-dimensional Euclidean space the coordinates of which are 
(for simplicity) independent normal distributions №(9, c), where the 
standard deviation ø is known. We seek an admissible estimator $ 
of the vector 9—(9,, 9,, ..., 9,) whose quadratic loss 


L(9, 9) = 19—91° = 08-6) 


is, on the average, minimal. Then the vector 9=X (the k-dimensional 
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vector of the sample average) is admissible only in 1 and 2 dimensions 
but not in higher dimensions (although the minimax property of X 
remains valid). Stein's recognition shows that even if we consider the 
classical estimating problem (that is, estimating the expected value of a 
normal distribution), X is not the only estimator we have to take into 
account. 


c) The explanation of the paradox 


(i) The interesting result of Kagan, Linnik, and Rao calls for proof 
rather than explanation. Instead of reproducing the proof, here is a 
method for finding asymptotically optimal estimators. First of all con- 
sider the example of a uniform distribution function F(x) on the interval 
(—c,c), (where с is an arbitrary positive number), and let F,;(x)= 
=F(x—9); then D(X)=c?/3n. If Xf=min X, and X*=max Х;, 
i.e., Хү is the smallest and X7 the largest sample element (they are both 
uniquely defined with probability 1 since the distribution is continuous) 
then 
р | "n 267 
2 (n+1)(n+2)’ 
which is far smaller than D(X) for large n. Since 


xy e AT 


2 


is also a simple unbiased and consistent estimator of 9, it is preferable to 
the “customary” X. Turning to the general case let X X75... X7 
be the ordered sample (i.e., ХҮ is the smallest from Xi, Хз, ..., Xn, etc.) 
and 


Ж = Oar 
i=1 


where aj, (i=1, 2, ..., n; п=1, 2, ...) are real numbers depending on Ё. 
One can show that under some mild conditions the following choice of 
an leads to a minimum variance unbiased estimator of 9 (at least asymp- 
totically as 7 一 co). Let a(x) be a real valued function on [0, 1] and 
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a;,—a(i[n)n. If F is 3 times differentiable then the optimal choice of 
a(x) is defined by 


a(F(3)) = – [04+ Вх) (log /(х))] 


Where 
"= a y, 
Hola — Hi Ho Ho — H1 
ЕО ГӨ 
аР“ 
апа 
К 
Ша = Fx) x?dx-1 


(the primes denote the derivatives, F’=f denotes the density function). 
This formula for a(x) can be applied even if the expectation of F does 
not exist and 9 denotes the center of symmetry of f(x). E.g., let (х) = 
EES (the Cauchy density; see the history of II/4). Then, 
surprisingly enough, a(x)=—A cos 2zx sin? zx is the optimal choice 
which is negative (!) when x is close to 0 or 1. In this case the “customary” 
X estimator is not even consistent. (For a detailed analysis of this topic 
see T. F. Móri—G. J. Székely, “How to estimate location and scale param- 
eters", Technical report, Eótvós L. Univ. 1986, see also H. Chernoff, 
J. L., Gastwirth and M. V. Johns, Jr. “Asymptotic distribution of linear 
combinations of order statistics with applications to estimation", 
Annals of Math. Statistics, 38, 52—72, (1967).) 

(її) After Stein's article, which was published in 1956, James and 
Stein suggested the following simple estimator for the expectation of a 
multivariate normal distribution in 1961 : 


x*=(1- (k—2)o? 


ixi? es )x, where К > 2. 


Then E|X*—S|*-—k, whereas E|X-—89|?—k, hence the estimator X 
is really not admissible. The estimator X* contracts the vector X towards 
the origin of the coordinate system, and as the origin can be chosen 
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arbitrary, the estimator 


(k—2)o? 
IX —Q|? 


is also better than X for any Q. Thus the James—Stein estimator depends 
on how we choose the origin Q, whereas X is independent of Q. (It can 
be shown that the estimator 


X = max 1-те °| 


is even slightly better than X*.) 

Now we shall turn our attention to the heuristic explanation of why 
the estimator X* is better than X. Consider the samples of k independent 
estimator problems together. The dispersion of the scalar sample elements 
is due, partly, to a (common) standard deviation o of the k distributions 
and partly to the (generally) unequal expectations 9;. Although these 
unknown expectations may be quite different, the combined sample 
may still show a dispersion which indicates that the values 9; actually 
do not differ considerably. For example, in the case where o=1 and 
about 16% of the observations are greater than 1, and 16% of them are 
smaller than —1, it is reasonable to think that all the expectations 9; 
are near to zero. In this case, if X;—0.8, the usual estimate of the ith 
parameter is 0.8, whereas—according to the more “rational” conception 
of the James—Stein estimator—the expectation 9; is nearly zero. Though 
this explanation may convince us of the “rationality” of the James— 
Stein type estimations, it still seems extraordinary if we want to estimate, 
for example the expected values of the (normally distributed) body height, 
velocity of light and that of the price of a product, there can be any kind 
of connection between these problems. 


0+(1- 49°) ar- o 


54| 


d) Remarks 


(i) The following inequality due to Cramér and Rao gives useful infor- 
mation concerning both paradoxes. Let f(x, 9) be the common density 
function (depending on the parameter 9) of the sample elements X;, 
i=1,2,...,n and let B(9) denote the bias E(§,—9) of the estimator 
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9 of 9. Then the Cramér—Rao inequality claims that 


(1+B’(9))? 


Е(9,— 9)? = B(9)?+ ni) 


holds under certain regularity conditions, where the function B’(9) 
is the derivative of B(9) and 


—d?ln f(X,, 9) 


Қ) = E—— Xs 


is the Fisher information. Thus the rate of convergence of E($,— 9)? 
to zero cannot exceed 1/n. However, in the example of the first paradox 
(uniform distribution, B(9)=0 and therefore E($,—9)*— D*($,)) the 
rate of convergenceis 1/r?. This is not a contradiction because the example 
in question is a typical case where the above mentioned regularity 
conditions do not hold. (The following condition, for example, would 
be a sufficient regularity condition: the set of the numbers x where f (x, 9) 
is positive does not depend on 9.) Concerning our second paradox, the 
Cramér—Rao inequality points out that allowing biased estimators, i.e., 
if we drop the condition B($)=0, and if B'(9) happens to be negative, 
E(9,— 9)? might decrease considerably compared to the variance of the 
minimum variance unbiased estimator. 

(iii) Let Xf-X7-...—X; denote an ordered sample and let X" 
denote the sample median, i.e., X' = X7 зуг if n is odd and 


Ха + Хе +1 


X = 5 


if n is even. If we take the sample from a normal distribution then 
7 2 /2 r 
D? (X) x DX ) = 0.63D*(X ^, 


i.e., the efficiency of the estimator X’ is (asymptotically) only 6396 of that 
of X. The situation changes, however, if we slightly “perturb” the normal 
distribution: consider a random variable which is the mixture of two 
normal distributions, namely 91% is a normal distribution with expecta- 
tion $ and variance 1, and 9% is a normal distribution with the same 
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expectation 9 and variance 9. In this case the median X' is a better 
estimator of 9 than X. 

(ii) The following paradox of admissible estimate is due to S. M. 
Masani. Let X, X, be two independent random variables with expecta- 
tions mı, ms. Masani gives two examples (one binomial, one normal) 
in which an estimator, depending only on X;, is admissible for esti- 
mating mı. Here we mention only the binomial case. Let X4, Xo, ..., Xm 
be independent binomial random variables with parameters n, р;. 
One can prove that a necessary and sufficient condition for the linear 
estimator 


Р,(Х,, X; T Xm) = > a,X;[n,; +c 
i=l 
to be admissible for p,, when the loss function is the quadratic loss 
L(p,,p)—(pi—p)^ is that either 


О=а,<1, OSc=1 
апа 


ПА 


0s Ўа+с=1, 0s Ха+с = 1 

і t=1 
or a,=1 and a,=a,=...=a,,=c=0. If we put a,=0 then we get a 
large class of admissible estimators for p, not depending on X,. 
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3. PARADOX ESTIMATORS OF THE VARIANCE 


a) The history of the paradox 


Besides expected value, variance is the other most important characteris- 
tic of random variables and their distributions. Estimate the unknown 
variance D? of the random variable X from the sample X1, Xo, ..., X, 
(these are independent observations and havethe same distribution as X). 
If the expected value E is known then the estimator 


is unbiased. The situation changes if E is unknown and is replaced (in 
the above formula) by its unbiased estimator X. Then the estimator 


is no longer unbiased. Since unbiasedness has been one of the most 
required good properties of estimators (since Gauss' time) the estimator 
D? was modified to make it unbiased. (Several parameters do not have 
unbiased estimators at all. In these cases only asymptotic unbiasedness, 
1.€., 

шр 


for all 9EO is required. This property holds for D?.) Besides unbiasedness 
other important properties of good estimators were crystallized. A para- 
dox appears when different properties of good estimators do not lead 
to the same estimator. 


b) The paradox 


Multiplying D? by the Bessel factor we get 
рч = 一 (0-Х) 
11 


which is an unbiased estimator of the variance. 
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Suppose that X is normally distributed (with unknown expectation and 
variance) and we prefer minimax estimators (see 1.12. Remarks) with 
loss function L(D?, D) -(D* — р?)ур“; then D? must be modified in 


just the other direction: it has to be multiplied by 


n А 
1 to obtain a 


minimax estimator: 


se ioc». 


2 
(is risk is 1 ) Thus the minimax principle and unbiasedness led to 


n+ 
different estimators. 


c) The explanation of the paradox 


The sum 
PAC) 


is minimal only if a=X. However, the expected value E is generally 
not equal to X (only near it), therefore D? showing the real deviation 
is greater than D®. That is why the Bessel correction is needed. On the 
other hand there is no reason why minimax or admissible estimators 
should be unbiased. (We have already seen in II.2. that the James—Stein 
estimator of the expected value is better than the usual unbiased estima- 
tor X.) Since the unbiased and minimax estimators of the variance of 
normal distributions do not coincide, we have to decide with each practi- 
cal problem which one to choose. Fortunately, there is only a slight 
difference between the two estimators even at small values of n. (In other 
problems the difference, however, can be significant.) 


d) Remarks 


Though surprising, it can be proved that the above mentioned minimax 
estimator is not admissible. (See Stein's article or Zacks' book.) On the 
other hand, if the expected value of the normal distribution is known, 


7 Székely 
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the estimator 


1 
п+2 


2i (X,-E)? 


2 Е , 
is not only minimax (vin a risk of 7 ;] but also admissible regarding 
n 


the mentioned loss function. (See Zack's book.) 
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4. THE PARADOX OF LEAST SQUARES 
a) The history of the paradox 


Due to the inevitable errors of observed measurements, theoretical for- 
mulas and empirical data frequently seem to be in contradiction. Legendre, 
Gauss, and Laplace elaborated an efficient method to diminish the effect 
of measurement errors early in the last century. (Legendre, for instance, 
worked out and applied it in 1805 to determine the orbits of comets.) 
The pioneers of this theory were Galileo (1632), Lambert (1760), Euler 
(1778), and others. The new procedure, called the method of least 
squares was discussed in detail by Gauss in his work “Theoria Motus” 
(1809). It was also Gauss who pointed out the probabilistic background 
of the method. (Though Legendre accused Gauss of plagiarism, Le- 
gendre could not properly substantiate his repeated accusations. Gauss 
claimed priority only in the use of the method and not in its publication.) 
Laplace published his fundamental book on probability theory in 1812 
which he dedicated to “Napoleon the Great". The entire fourth chapter 
is devoted to the calculus of error. Since then the method of least squares 
has developed into a new branch of mathematics. It is sometimes ‘‘over- 
mystified" and often used when other methods would be more expedient. 
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This problem was emphasized even by Cauchy (Comptes Rendus, 1853) 
during his “debate” with Bienaymé (in the course of this dispute Cauchy 
used the probability density function 1/z(1-4-x?), which was later named 
after him, though he was not the first scientist to use the “Cauchy 
density”). 


b) The paradox 


Let ae-*l*-"l be the density function of our observations subject to 


random measurement errors; the constants a and b are known and ди 
has to be estimated. We make independent observations X,, Xs, ..., Xn- 
According to the method of least squares, и has to be estimated by the 
value fi which minimizes the sum 


(X, — A)? (Xs — A) +... E (X, — B). 
It is easy to calculate that this sum is minimal if û is the arithmetical 
mean of the observed data: 


DIE nx 
二 一 一 


A 


However, if we prefer the estimator Д for which the probability (more 
precisely the probability density) that the results of n observations are 
just X,, Xo, ..., X, is maximal, i.e., if ji maximizes 

qn e-*X,- ul... -1X,— А) 
or, equivalently, д minimizes 

Denuo n PEE 
then we get a contradiction since the sum of squares and the sum of 


absolute values do not take their minimum at the same value of p, 
i.e., û and р are different. Which one is better? 


c) The explanation of the paradox 


If the measurement errors were normally distributed (i.e., if their density 
function were of the form 
aet- 


7* 
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the above mentioned contradiction would not appear since û maximizes 


a" e - *(X,— D ... G7 1)?). 


Gauss based the method of least squares on normally distributed errors, 
and in practice this is the most frequent case. However, if the distribu- 
tion of errors is known to be different from normal, then using the least 
squares estimator is not always advantageous. In the case of the above 
mentioned paradox, the use of the estimator ii is more reasonable (see 
also the previous section). 

Using the customary notions of mathematical statistics, the paradox 
can be formulated in brief as follows: the least squares estimator is not 
always compatible with the maximum-likelihood estimator (for maximum 
likelihood estimation see Section 8). In fact if f(x) is a positive semi-con- 
tinuous density function at x=0 (from below), and the density of the 
observations is f(x — 9) and 


Күй qu es 
n 


Х = 


is a maximum likelihood estimator of 9 for n — 2,3, then f (x) is the den- 
sity function of a normal distribution with zero expectation. This remark 
is the Gaussian law of error, and it can be proved as follows: if for 
simplicity the existence of the derivative f' is supposed and 


П/%—%) 


is maximal for 9— X then 


2 Í X-X) = 0, 


i=1 


i.e., (with the notation 4;— X;— X) 
> 4; = 0 implies 
і=1 


and this can be valied for п= 2,3 (when f/f is measurable) only if 


L4 


=x, and it follows that f=de~° where c and d are positive 
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numbers (otherwise f would not be a density function). So the least 
square estimator of the location parameter can coincide with its maxi- 
mum likelihood estimator only for normal distributions. 


d) Remark 


The arithmetic mean =X and the median ji are the only “simple” 
maximum likelihood estimators of the parameter u having the form 


where Хү = Xj =...= Xf‘ is the ordered sample and Ха = 1 
(n= 1,2, ...). Ду 
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5. CORRELATION PARADOXES 
a) The history of the paradox 


By the last third of the previous century, several sciences (e.g., molec- 
ular physics) had reached such a level of development that the adaptation 
of probability theory and mathematical statistics became indispensable 
in these fields too. In 1859 Darwin's book revolutionized biology, and 
shortly afterwards his cousin Francis Galton established human genetics. 
(Mendel's study on genetics was only “rediscovered” at the turn of the 
century, and the word genetics has only been used since 1905; but Gal- 
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ton's results had already aroused great interest in the last century.) 
Galton and his students (especially Karl Pearson) introduced many im- 
portant notions such as correlation and regression which became funda- 
mental ideas of both probability theory and mathematical statistics (as 
well as of other related sciences). A man's weight and height are, natu- 
rally, in close connection though they do not determine each other 
uniquely. Correlation measures this connection by a single number the 
absolute value of which is not more than 1. The correlation of two ran- 
dom values X and Y is defined as follows. Let E, and D,, E, and D, 
denote the expected value and the standard deviation of X and Y, resp. 
Then the correlation coefficient (in brief: correlation) of X and Y is 


EKX—EJ( Е) 


rer(x,r)- D.D 
ay 


The absolute value of the correlation is maximal (i.e., —1) if there is a 
linear relationship between X and Y, ie, Y=aX+b (where а;0). 
If X and Y are independent (and their variance is finite) then their corre- 
lation is 0, i.e., they are uncorrelated. In mathematical statistics the 
correlation r is usually estimated from the generally independent sample 
(X; , Ү,), (Xe, Yə), ..., (Xn, Yn) by the following sample correlation coeff- 
cient 


(АЛАТ) 


[| 2 (x,-X)? 2 0,- Y 


In several cases r gives a good characterization of the relationship be- 
tween X and Y but even at the turn of the century several senseless correla- 
tions were calculated, e.g., the correlation of the number of stork nests 
and that of infants. Correlations have gradually been mistificated and 
several “internal”, generally casual, connections were thought to exist 
in the case of close (near to 1 in absolute value) correlation. This is why 
totally absurd results were created which nearly succeeded in discrediting 
statistics as a whole. It generally was ignored that close correlation of X 
and Y might be caused by a third quantity. E.g., it was observed in 
England and Wales that when the number of radio licences was increased 
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there was a corresponding increase in the number of insane and mentally 
handicapped people. This interpretation is, however, completely false 
because listening to the radio does not bring about mental illness; it is 
simply that as time passes the number of radio listeners as well as the 
number of mental cases increase though there is no causal connection 
between them whatsoever. Unfortunately, misinterpretations are not 
always so simple to discover, e.g., in technical or economical applica- 
tions. The comparison of religion and height is another senseless corre- 
lation, which claims that going from Scotland in the direction of Sicily 
the rate of Roman-Catholics gradually increases while the average height 
of people decreases gradually; but of course any causal interpretation 
is absolute nonsense. (Even more farcical ideas were claimed to be 
causal relationships and even science by Fascist racial theory.) We will 
mention only some of the existing correlation paradoxes. 


b) The paradoxes 


(i) Let X be uniformly distributed over the interval (—1, 1) and Y 2|X]. 
Obviously, there is a very close relationship between X and Y, but their 
correlation r(X, Y)=0. (The correlation of X and Y=|X| is always 
0 when X is a random variable with finite variance and has a symmetric 
distribution around 0.) 

(ii) Let X,, Xa, ..., X, be the temperature of a room in n different 
moments and Yj, E, ..., Y, be the quantity of the fuel used up for 
heating in the same moments (more precisely during a given period, 
e.g., 1 hour before these moments). It is logical to think that the more 
fuel used the warmer the room will be. It means that the correlation of 
X and Y is strictly positive. In spite of this the correlation may be nega- 
tive, which can be interpreted as the more we heat the colder it will be. 

(iii) Let (X, Y) be normally distributed, i.e., let the density function 
of. (X, Y) be 


I(x, у) = 


l | 


~ 2nD,D, Viri 
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where E,, D,, E, and D, are the expected values and variances of X 
and Y, and r is their correlation. Now we suppose that the absolute value 
of the correlation is strictly less than 1. If r is unknown we can estimate 
it by f from a sample of n elements. If E, and E, are known then it is 
advisable to modify the formula of f so that X and Y are replaced by E, 
and E,, resp. In this way we obtain a new estimate r. As r uses more 
information (namely the knowledge of E, and E,) we might think that 
its variance is less than that of f. However, A. Stuart calculated that 


D*(f) = =q —r?? while D?(F)= T +r?) 
consequently the latter is the bigger. 


c) The explanation of the paradoxes 


(i) If X and Y are independent then r(X, Y)=0 but the inverse asser- 
tion is false. Uncorrelated values may be strongly dependent as in the 
above example where Y —|X|. Therefore “being uncorrelated” must 
not be interpreted simply as being independent. On the other hand, 
it can be proved that if X and Y are uncorrelated under the restrictions 
Xj, X-—x$,y,—Y-ys, whatever the number ху<х» and у, =у, be, 
X and Y are independent. 

( We must not forget the disturbing effect of the temperature out- 
side! 

We often obtain completely unbelievable correlations because the ex- 
pected correlation coefficient of two random variables had been twisted 
by a third, “exterior disturbing variable". It is precisely to avoid such 
distrubing effects that the notion of partial correlation has been introduced. 
If the correlation of X and Y is calculated only after having filtered 
out the disturbing effect of the variable Z then the result will no longer 
be a paradox. Let 75, ms and ra denote the correlations r(X, Y), r(X, Z) 
and r(Y, Z), resp. Then the partical correlation of X and Y after having 
filtered out the effect of Z is 


Tyo — Руз Ѓоз 


Ya-n90-r&) 
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In the special case when 7,5—7;5—0, the partial correlation of Y and 
Y is equal to the simple correlation гу». If 745, 713, Рз are not known then 
they can be estimated from the sample just like r. By the help of these 
estimators, we shall obtain an estimator for the partial correlation 
coefficient. 


Figure 9. Considering the random variables X, Y, and Z as vectors, the correlation 
of random variables X and Y is the cosine of the angle between the vectors X 
and Y, and their partial correlation is the cosine of the angle between their images 
under projection onto the plane perpendicular to vector Z. 


(iii) Stuart’s paradox can be shown from many aspects. The main 
point is that £ and F are not unbiased estimators of r, i.e., the identities 
E(f)=r and E(r)zr are not valid, and if so, it is not advisable to 
consider an estimator better if its variance is less. At the same time neither 
f nor r is very biased (asymptotically unbiased), therefore the explana- 
tion of the paradox needs further analysis. (See the Remarks and Stuart's 
article.) 


d) Remarks 


(i) The bias of the estimator r (in case of bivariate normal distribution) 
is the following 


т?—1 


Е(#—т) = 


= +о(п-?) 


94 Chapter 2 


where o(n-!) denotes an expression which converges to 0 even if mul- 
tiplied by n. Thus bias converges rather fast to 0 (as the sample size п 
increases). On the other hand, it is interesting that arcsin f is an unbiased 
estimate of arcsin r, and if E(g(£)) —E(g(r)) for some function g inde- 
pendent of n then g(r)=a -arcsinr+b, where a, b are arbitrary constants. 
In 1958 I. Olkin and J. W. Pratt proved: if the estimator of the correla- 
tion coefficient r may directly depend on 7 then one can find an unbiased 
estimator for r itself, namely 


teo iai 


where F is the hypergeometric function given by 
AE U) 4) = 


owes a (asl). c (ase Ьо) GIRD 
02 kic(e+l)...(c+k—1) E 


where a,b,c (c#0, —1, —2,...) are parameters. Among unbiased 
estimators it is already worth preferring those of minimal variance. It 
can be shown that r* is not only unbiased but of minimal variance, too. 
However, r* is rather complicated in practice, therefore it is advisable to 
apply the following approximation of it: 


(re. 


(ii) It is not a paradox but it is nevertheless surprising that in choosing 
m numbers randomly from 1,2, ..., п (sampling without replacement, 


i.e., the number of equally probable choices is (| the correlation 


NS. a MDC 
coefficient of the smallest and greatest number chosen is —, i.e., it is 
m 


independent of n (m—1, 2, ..., n— 1). More generally, if Xi, Xo, ..., Xm 
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denote the increasing series of the т values selected, then 


=} i(m+1—j) 
r(X;, Xj) = j(m+1—i) 


which is independent of n, too. 

[The following result is due to T. F. Móri. Take a sample of n elements 
from 1,2, ..., т+1 with replacement and denote by Y; the number of 
sample elements the value of which is not greater than i. Then r(Y;, У) = 
=r(X;, X;).] Another related problem is the following. If Z,=Z,=... 
,..=Z,, is an increasing sequence of independent uniformly distributed 
random variables ("ordered sample") then r(Zi,Zjr(X;, Xj); if 
the uniform distribution is replaced by any other distribution for which 
r(Z;, Z;) exists then r(Z;, 2) er(X;, Xj), i.e., this is an extremal prop- 
erty of uniform distributions (see the paper by Móri—Székely). In 
fact we can prove more. Let the maximal correlation of two random 
variables U and V be defined as the supremum of r(f(U),g(V)) where 
f and g run over the set of square integrable real functions of U and V, 
resp.: 

max corr (U, V) — р r(f(U), g(V)). 
‚9 


Using this notion, we can prove that 
max corr (Z;, Z;) = r(X;, X;). 


This follows from the fact that max corr (U, V) -r(U, V) if both the 
regression of U on V (for the definition of regression see the next para- 
dox) and the regression of V on U is linear (and not identically constant). 
It is precisely this case when correlation is a good measure of closeness. 
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6. REGRESSION PARADOXES 
a) The history of the paradoxes 


The correlation coefficient measures the dependence between two random 
variables by a single number, whereas regression expresses this depend- 
ence by a function-like relation and thus gives more precise information. 
For example, the average body weights as a function of body heights 
is a regression. The name "regression" comes from Galton, who, at 
the end of the past century, compared the heights of parents with that 
of their children. He found that the heights of children with tall (or 
small) parents are usually above (or below) the average but not as much 
as their parents' heights. The line which showed to what extent the 


The extent of "regression" (return) of height through subsequent 
generations, by Galton 


Height If the parents’ Distance 
(inch) mean is above the (inch) 
average height. 
the children 
usually are shorter 
than their parents 


If the parents 
mean is below the 


average height, 
the childrer 
usually are taller 
than their parents 


Figure 10. Galton's regression line. 


Paradoxes in mathematical statistics 97 


heights (and other properties as it later turned out) regressed (returned) 
to the average through subsequent generations was called a regression 
line by Galton. Later any function-like relation between random vari- 
ables were called regression. Regression analysis was applied first in 
biology, and the most important scientific journal which dealt with this 
topic was the Biometrika, published since October 1901. In the years 
between 1920 and 1930, its economic applications also became very im- 
portant and a new branch of science sprang up: econometrics (the term 
is due to R. Frisch, 1926, who was later awarded the Nobel prize), with 
its own journal, Econometrica, first published in 1933. From examining 
special regression problems, researchers gradually came to the regression 
analysis of the intrinsic structure of global economic systems (J. M. 
Keynes, J. Tinbergen and others as, e.g., R. L. Klein, who won the Nobel 
prize for economics in 1980). The journal Technometrics has been pub- 
lished since 1959 and mainly deals with technical applications. The re- 
gression analysis of a quantity X on another quantity Y—where X is dif- 
ficult to measure and Y can be measured quite easily—is very important. 
Nowadays almost every branch of science applies regression analysis, 
which is a good thing in itself, but unfortunately regression analysis has 
also become one of the chief means of ''facile scientific successes", 
slipshod analyses and glossing over (scientific) problems. Regression 
never substitutes scientific conceptions and theoretical background, 
though it can help to find them. 


b) The paradoxes 


Suppose the dependence of two variables is given by a function of the 
following type: 


y = f(x; 01, d2, es: an) (e.g. Y: = d, X +4), 


where only the parameters ау, dp, ..., Am are unknown, (the type of func- 
tion, e.g., linear, quadratic, etc. is known). If we can measure the values 
of y only with random observational errors, i.e., instead of y;—f (xj; 
ау, d, ..., Am) We observe the values У, subject to errors, then, according 
to the method of least squares, the unknown a;'s minimize the sum of 
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Squares 


2 О Лс ла а E 


(i) Accordingly, if /(х)=е°^ then the estimator of a minimizes 


> emp 


In this case the problem of calculating the regression-curve f is usually 
simplified by taking the logarithm of both terms of the difference in 
brackets and minimizing the quantity 


a 


(In Y, m ax, 
1 


Б 
i 


which can be easily performed by finding the minimum of a quadratic 
polinomial. However, the two minimizing methods give different estima- 
tors. What is the solution of this paradoxical situation? 

(ii) Suppose we have only an alternative idea about the type of f, 
for example, f; is a polinomial and f, is an exponential function. It seems 
natural to accept the type for which the above sum of squares is smaller 
(under optimal choice of the parameters). Though this principle is often 
followed in practice, usually it is not reasonable (sometimes even the 
theoretical possibility of this choice must be questioned). 

(iii) Let y=ax be the theoretical regression line and let Y;=ax;+6;, 
where ғ; (i=1, 2, ..., п) are independent normally distributed random 
errors with expectation 0 and variance D*(e;)=cy (с is a known con- 
stant). Now suppose that the observations happen to fit the regression 
line perfectly, i.e., Y;=a x; for some ао, thus 


Ms 


(Yi == Xi)? = 0. 


1 


Then the least squares estimator of a is do, but, paradoxically, this is 
not the ''best" estimator (in the sense of maximum-likelihood; cf. 
definition in paradox 8). 
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c) The explanation of the paradoxes 


(i) Undoubtedly the method of least squares corresponds to the first 
method, nevertheless it is useful to consider not only the letter but also 
the spirit of this principle, since the qualitative meaning of least squares 
is the minimization of the total effect of errors. This purpose can also be 
achieved by minimizing the sum of squares 


(НО) ВОО as ars ae. 


where h(x) is a monotone increasing function (e.g, h(x)=ln x). 
А good choice of h “‘linearizes”, i.e., makes the formula A( f (x1; ау, аз, ... 
...› Am)) a linear function of the unknown parameters a; (in this case the 
optimal value of aps can be easily determined). If we want to determine 
the unknown parameters in the spirit of least squares principle, it is 
clearly better to choose the second method. However, the original sum 
of squares, for example, may have to be minimized if we know that the 
errors result in a financial loss which is proportional to this sum of 
squares, though this possibility is far from typical. 

(її) The first part of the question is very simple: the sum of squares 
may be smaller for f, than for fz, but, taking some more sample elements 
into consideration, the sum of squares becomes smaller if we choose fa. 
Mathematical statistics tries to avoid such unstable situations. There are 
some decision methods available in certain cases, which decide with 
given, e.g., 99% certainty, (i.e., if f, is rejected then the probability that 
fi is the right choice is 1%). In Plackett's book, for instance, a method is 
discussed which enables the proper degree of regression polynomials to 
be chosen (in case of independent, normally distributed observetional 
errors). Unfortunately, many typical alternative regression problems 
cannot be properly handled. For example, the Weber—Fechner rule 
states that there is a logarithmic relation between stimulus and sensation, 
especially between volume and sound intensity, or between frequency 
and pitch. Nowadays this rule is considered only an approximate one 
both theoretically and experimentally, because a power-function rela- 
tion seems to be closer to the truth. (In fact, the problem is more compli- 
cated as the sensation of loudness depends not only on intensity but also 
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on frequency and the spectrum of sound as well as on the duration of 
the experiment.) 

(iii) The estimator û=a, is not satisfactory since the estimator of 
D*(ej) would than be zero which contradicts the condition D*(ej)— cy. 
The estimator 


[(Ї1+4с*— 1/29] ao 


is more reasonable (maximum-likelihood). 


d) Remarks 


(i) The logit-probit alternative is also very typical—especially in pharma- 
cology and market research. In logit analysis, the function 


ей +азх 


Y = Трента 


is fitted to the data by the method of least squares, minimizing 
n Y. 2 
2 (in JEY. = — dı -ax . 


[Here the transformation function which linearizes the problem is 


h(x)=In — 


1 | In probit analysis a normal distribution function is fitted 
—x 


to the data (by an appropriate choice of the parameters). The shapes of 
the two types of curves may be quite similar, so it is not always easy 
to decide which one to choose, but the theoretical background may be 
a great help. 

(ii) If we increase the number of regression parameters then obviously 
we obtain a better fitting but then the variances of parameter estimators 
increase, so the estimators become less stable and less reliable. 

(iii) For the “paradox of the two regressions” see Kalman (1982). 
In this paper (following the pioneering work by Gini (1921) and Frisch 
(1934)) it is assumed that there are random (additive) errors in both 
variables: X=x+ and У=у-+Ў (£,$ are the errors or "noise"). 
Supposing y=ax, the ''unprejudice estimate" of a can be given only 
in terms of an interval: a,-aza;. Here one of the limits is the classical 
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regression coefficient (when y is regressed on x), and the other limit is 
the reciprocal regression coefficient (when x is regressed on y). The choice 
of either limits a, or a, implies the prejudice that the regressor variable 
is noise-free. (This gives a solution for the “paradox of the two regres- 
sions". 
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7. PARADOXES OF SUFFICIENCY 


a) The history of the paradox 


Sufficiency is one of the most important concepts in mathematical sta- 
tistics. Its use was introduced by R. A. Fisher in the 1920s. He set out of 
the idea that, for statistical inference concerning unknown parameters, 
we do not always need to know all the sample elements one by one. It 


8 Székely 
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is enough to know some functions of the sample called sufficient statis- 
tic. E.g., in the case of a one dimensional normal distribution all the 
information concerning its expected value is contained in the arithmetical 
mean X of the sample elements X;, Xs, ..., X,. This follows from the 
fact that the distribution of (X, — X, X, — X, ..., X, — X) is independent 
of the unknown expected value and if so, no further information can be 
obtained about it from the random variables X, — X, X,— X, ..., X,— X. 
The mathematical definiton for sufficiency is as follows. The functions 
Tam da Xt asco Xo, Xs vn, Ne il e = (ОА Аср УЛ) 
are called sufficient statistic for the parameter 9 of the common distribu- 
tion of X; if the joint distribution of X1, Xs, ..., X, with given Т, To, .. 
..., T, is independent of 9. Returning to the above example, the joint 
conditional density function of the independent variables X;, X2, ..., X, 
given X—X is 
1 +. E oG-23! 


no 一- 一: е 20% =1 
(2л 00) үп 


(where со denotes the standard deviation of X;) and this density does 
not depend on 9. 


b) The paradox 


It was Fisher who pointed out the following paradox of sufficiency in 
1934. Hestudied a two-dimensional normal distribution whose coordinates 
were independent (for simplicity) with variance 1. Only their expected 
values were unknown. The arithmetical mean X=(X,, Х,) of the two- 
dimensional sample is a sufficient statistic for the unknown pair of 
expected values. Suppose that the distance between the expected value 
(considered a vector) and the origin, i.e., /9?+92 is known, say 3. 
Then 
(9,, 9,) = 3(cos 9, sin 9), 


where 9 is the only unknown parameter. It can be estimated by 


^ X 
9 = arctg у. 
1 
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This is an unbiased estimator: E(9)=9 and its variance is E(9—9)?= 
=0.12. It is easy to prove that the distribution of r=) X?+ X2 is inde- 
pendent of 9 (since the distribution of (X,, Х,) has a rotational symmetry 
around (8,, 8,)) therefore, due to sufficiency, no information concerning 
9 can be gained by taking r into account. This is, however, anything but 
true. The expected value of ($— 9)? (i.e., the efficiency of the estimator) 
is strongly influenced by the knowledge of r. E.g, E(($—3)!r—1.5)— 
—0.26, E((9-9)?|r= 3)=0.12, and E((9—9)?|r=4.5)=0.08. 


c) The explanation of the paradox 


Fisher’s paradox points out that “having all information” can be inter- 
preted different ways. In calculating the efficiency of estimations, the 
ancillary statistics (like r) may have an important role. Unfortunately, 
it is not always easy to decide what to take for ancillary statistic. Ob- 
viously, taking the whole sample as ancillary statistic is not worth-while. 
If Fisher’s problem is considered from a Bayesian point of view and 9 
is supposed to be uniformly distributed on the interval (— m, n) then 


E((9—9)|X,, X.) = E(9—9)r). 


d) Remarks 


The modern theory of sufficiency is due to P. R. Halmos and C. L. 
Savage (1949). Several interesting paradoxes were brought up in this 
field too. E.g., Burkholder (see the References) presented some patholog- 
ical examples showing that if we add some more information to a suffi- 
cient statistic then sufficiency may get spoiled. This example totally 
contradicts our general view of sufficiency. In the past decade several 
deep papers were published in this field introducing some “regularity 
conditions"; these ensure the non-paradoxical (non-pathological) 
behaviour of sufficient statistics. 


8* 


104 Chapter 2 


e) References 


Burkholder, D. L., “Sufficiency in the undominated case", Annals of Math. Statist., 
32, 1191—1200, (1961). 

Burkholder, D. L., “On the order structure of the set of sufficient c-fields", Annals 
of Math. Statist., 33, 596—599, (1962). 

Efron, B., “Controversies in the foundations of statistics", The American Math. 
Monthly, 85, 231—246, (1978). 

Fisher, R. A., “On the mathematical foundations of theoretical statistics", Phil. Trans. 
Roy. Soc. Ser. A, 222, 309—368, (1922). 


8. PARADOXES OF THE MAXIMUM-LIKELIHOOD METHOD 
a) The history of the paradoxes 


One of the most efficient methods of estimating unknown parameters is 
the maximum-likelihood estimation. It gained ground in the twenties 
through the work of the English statistician R. А. Fisher. Though Fisher 
had predecessors, it was his article that made the decisive breakthrough 
in 1912. In order to elucidate the method, suppose, for simplicity, 
that the density function of the probability distribution (depending on 
the unknown parameter 9) exists and denote it by fs(u). If the sample 
elements X,, Xs, ..., X, are independent, their joint density function is 


П fo ui). 


Let the numbers ху, хз, ..., x, be the observed values of the sample. 
Then 9 is the maximum-likelihood estimator of 9, if Û maximizes 


Ш fo x) 


as the function of 9 (supposing the maximum exists and is unique). In 
the case of discrete random variables X;, we maximize the joint proba- 
bility Ps (X1=x1, X;—x;, ..., X,—x,). If we estimate 9 by the method 
of maximum likelihood then the probability (or probability density) 
that we observe ху, X3, ..., x, becomes maximal. The maximum-likeli- 
hood estimator has several good properties, this is why it is a widespread 
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method. If, for example, $ is the maximum-likelihood estimator of 9, 
then g($) is the maximum-likelihood estimator of (3). It can also be 
proved that under quite general conditions the maximum-likelihood 
estimator Û behaves asymptotically as a normally distributed variable 


with mean 9 and variance (see Paradox 2. Remark (i)); thus ĝ 


1 
п1(9) 
is consistent and its asymptotic variance is minimal, (i.e., it is asymptot- 
ically efficient). Moreover if a sufficient estimator exists (cf. “Paradoxes 
of Sufficiency”), then the maximum-likelihood method gives a function 
of this sufficient estimator. 


b) The paradoxes 


(i) Let Xi, Xs, ..., X, be independent, uniformly distributed random 
variables on the interval (8, 29). The maximum-likelihood estimator of 
the unknown parameter 9 is max X;/2. Its slight modification. 


2n+2 
2п+1 


9 = max X;/2 


: ; 1 : а 1 
is an unbiased estimator of 9 with variance Ds On the other 
п 


hand, the variance of the estimator 


ay (min X;+2max X;) 
is asymptotically 1/5n?, hence this estimator is more efficient than the 
maximum-likelihood estimator, whose asymptotic efficiency is maximal. 
(ii) A very simple example can be found to illustrate that a maximum- 
likelihood estimator is not always consistent. Let 4 be the set of rational 
numbers between 0 and 1, and В a set of countably many irrational num- 
bers between 0 and 1. Suppose the independent sample elements X1, X,,... 
..., X, take only the values 0 and 1, and they take the value 1 with proba- 
bility 9 if 9 is an element of A, and with probability 1—9 if 9 is an ele- 
ment of B. Then the maximum-likelihood estimator of 9 is not consistent. 
(Though there exists a somewhat more complicated consistent estimator 


of 9.) 
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c) The explanation of the paradoxes 


(i) The statistics £—min X, and у= тах X, together contain all the 
information concerning 9, more precisely, given ё and у the joint density 
function of №, Xz, ..., X, does not depend on 9 (i.e., č and у together 
are sufficient). Thus it is natural that both the maximum-likelihood esti- 
mator and the one which turned out to be a better estimator depends only 
on é and у. Since the maximum-likelihood estimator depends only on 
n, which is not sufficient by itself (it does not contain all the information 
concerning 9), it is not very surprising that we could find a better estima- 
tor. This does not contradict the asymptotic efficiency of maximum- 
likelihood estimators since, in case of uniform distributions, the “general 
conditions" that assure this efficiency do not hold. 

(ii) The explanation is quite simple: the maximum-likelihood estimator 
of 8 is the relative frequency 


2 Хп 
i=1 


and it tends to 1— 9 if 9 is irrational. 

Though this problem is somewhat pathological, it is at least easy to 
understand. (The paper of D. Basu gives a consistent estimator of 9.) 
There exist other examples of non-consistent maximum-likelihood esti- 
mators which are less artificial, but more complicated (cf. the papers 
by Neyman—Scott, Kiefer— Wolfowitz and Ferguson). 


d) Remarks 


(i) There are numerous ‘‘maximum-likelihood” estimators in the statis- 
tical literature where no real maxima were found (just saddle-points) 
or only one of the local maxima was considered.* Though the frequent 
appearance of these examples is rather interesting, they cannot be consid- 
ered paradoxes, only “oversights”, even if they are published in first 
rate journals by the best mathematicians. 


* One of the simplest and most important example where the local maximum is 
not unique is the normal distribution with unknown expectation 9 and variance 
proportional to 9°. 
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(її) Examples of J. L. Hodges and others raised the paradox problem 
of superefficiency. Here we only refer to the dissertation by L. Le Cam 
and the paper by Н. Chernoff. (An estimator of 9 is superefficient if its 
distribution is asymptotically normal with mean 9 and variance not 


i ET 1 у 
more than the asymptotically minimal 18) and strictly less for at least 
n 


one value of 9.) 


e) References 


Bahadur, К. R., “Examples of inconsistencies of. maximum likelihood estimates", 
Sankhya, 20, 207—210, (1958). 

Barnett, V. D., “Evaluation of the maximum likelihood estimator when the likelihood 
equation has multiple roots", Biometrika, 53, 151—165, (1966). 

Basu, D., “An inconsistency of the method of maximum likelihood", Annals of Math. 
Statist., 26, 144—145, (1955). 

Berkson, J., *Minimum chi-square, not maximum likelihood!" Annals of Statist., 
8, 457—487, (1980). 

Boyles, R. A., Marschall, A. W., Proschan, F.; “Inconsistency of a distribution having 
increasing failure rate average", Annals of Statist., 13, 413—417, (1985). 

Chernoff, H., *Large sample theory: parametric case", Annals of Math. Statist., 27, 
1—22, (1956). 

Edwards, A. V. T., "The history of likelihood", Internat. Statist. Rev., 42, 9—15, 
(1974). 

Ferguson, T. S., “An inconsistent maximum likelihood estimate", J. Amer. Statist. 
Assoc., 77, 831—834, (1982). 

Fisher, К. A., “Оп an absolute criterion for fitting frequency curves", Messenger 
of Mathematics, 41, 155—160, (1912). 

Fisher, В. A., “Оп mathematical foundations of theoretical statistics", Phil. Trans. 
Roy, Soc. ( London) Ser. Á, 222, 309—368, (1922). 

Kale, B. K., ‘“‘Inadmissibility of the maximum likelihood estimation in the presence 
of prior information", Canad. Math. Bull. 13, 391—393, (1970). 

Kiefer, J., Wolfowitz, J., “Consistency of the maximum likelihood estimation in the 
presence of infinitely many incidental parameters", Annals of Math. Statist., 27, 
887—906, (1956). 

Konijn, H. S., “Note on the nonexistence of a maximum likelihood estimation", 
Aust. J. Statist., 5, 143—146, (1963). 

Kraft, C., Le Cam, L., “A remark on the roots of the likelihood equation", Ann. Math. 
Statist., 27, 1174—1177, (1956). 

Le Cam, L., “On some asymptotic properties of maximum likelihood estimates and 
related Bayes’ estimates", Univ. California Publ. Stat., 1, 277—330, (1953). 


108 Chapter 2 


Neyman, J., Scott, E. L., “Consistent estimates based on partially consistent observa- 
tions", Econometrica, 16, 1—32, (1948). 

Norden, N. H., “А survey of maximum likelihod estimation", Intern. Statist. Rev., 40, 
329—354, (1972). 

Rao, C. R., *Apparent anomalies and irregularities in maximum likelihood estima- 
tion", Sankhya, 24, 73—102, (1962). 

Reeds, J. A., “Asymptotic number of roots of Cauchy location equations”, Annals of 
Statist., 13, 775—784, (1985). 


9. THE PARADOX OF INTERVAL ESTIMATIONS 
a) The history of the paradox 


The theory of interval estimation was developed basically by R. A. Fisher 
and J. Neyman between 1925 and 1935. Neyman’s confidence interval 
contains the unknown parameter 9 with a prescribed probability a. 
Let X, X2, ..., X, denote the sequence of sample elements and let A= 
SA (Xi Х.Х, @) and Bz-B(X,,X,,3,,0) besuch that A= 
<9<B)=«. Then (A, B) is called the a-confidence interval for 9. If 9 
denotes the unknown expectation of a normal distribution with standard 
deviation o, then 


| Р(Х –20[үп < 9 < X +20/Vn) ~ 0.95, 
ке, 


(Х—2в/үп, X +20/Vn) 


is a 9596 confidence interval for 9. Another type of interval estimation 
considers not the sample but the unknown parameter 9 as a random 
variable. In this case the interval (А, B) does not depend on chance, and 


P(A<93<B)=a 


means simply that 9 falls into the interval (A, B) with probability «. 
E.g., if 9 denotes the unknown expected value of a normal distribution 
then 9 is not determined completely by the sample mean X due to random 
errors in measurement, this 9 can be considered a normally distributed 
random variable with expected value X and standard deviation of yn. 
Hence 


Р(Х —2e|Yn < 9 < Х+2в[үп) ~ 0.95. 
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This kind of interval estimates called fiducial intervals was introduced 
by Fisher. As we can see in case of normal distribution, confidence and 
fiducial intervals of expected values are formally the same only their 
"philosophy" is different. It was believed for a while that these two 
intervals were practically the same and the debate confidence contra 
fiducial seemed to be only theoretical. (At first it was Neyman who 
supported Fisher's fiducial theory the most mainly because Fisher also 
failed to apply Bayes' theorem.) Paradoxes of practical importance have 
appeared, however, rather soon. The different philosophy of Fisher and 
Neyman led to different results in practical application as well. In 1959 
C. Stein pointed out an extremely paradoxical case. For simplicity, he 
considered confidence and fiducial intervals for which B=o or 4= —ee 
because these kind of intervals are determined by a single value (the 
other end point of the interval). 


b) The paradox 


Let X1, Xs, ..., X, be independent, normally distributed random variables 
with unit variance (kz2) and denote 9%, 9s, ..., 9, their unknown 
expected value. Let the distance of the vector 9=(9,, 9,, ..., 9) from 
the origin be 


|9| = ҮЗ E93... +92. 


Stein proved that the confidence and fiducial intervals of |9| may differ 
extremely which results in the following paradox. Let us estimate every 
9, by the mean value X; of an n-sized sample. Let the distance between 
the origin and the sample mean vector (Xj, X,,..., X,) be |X|= 
=) X?+X2+...+X2. Then P(|X|-]|9)-0.5 if X is the ran- 
dom variable (confidence interval) whatever the value of the un- 
known 9 is. On the other hand, if 9 is the random variable (fiducial inter- 
val) then P(|9|>|X|)>0.5 for any value of the sample mean X. In 
other words, the probability that the confidence interval (— ee, |X|) 
contains the unknown |8| is more than 50% while it has also more than 
50% chance that the random |9| is contained by the (fiducial) interval 
(|X|, + с). Thus, in the confidence approach, it is favourable to bet on 
the inequality |X|>J|9| while with the fiducial approach, it is just the 
other way round. 
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c) The explanation of the paradox 


It is impossible to show all the discrepancies between the confidence and 
fiducial approaches in connection with Stein's problem. Here we restrict 
ourselves to a solution proposed by Stein himself. If the fiducial approach 
is applied not to the sample elements given by their coordinates but 
(because of the rotation symmetry of the normal distribution) to the 
sum of squared coordinates then fiducial intervals became equivalent to 
confidence intervals (see Stein's paper). Consequently, it is more advan- 
tageous to bet on “|X| is greater than |9|”. 


d) Remarks 


(i) Let us construct an interval estimation for the unknown expected 
value $ of a normal distribution with known standard deviation c, using 
the a priori information that 9 is normally distributed with an expected 
value of и and standard deviation of s (и and s are known). If the mean 
of the n-sized sample is X, then, according to Bayes’ theorem, the a pos- 
teriori distribution of 9 is also normal with the expected value 


9* = u-C(X —u) 
and standard deviation D, where 


2 
ie and Рр? = : 


(Gu due: ыы = э OD 
1/s? + п/а? 1/52 + п/о? 


Therefore (9* —2D, 9* 十 2D) is a 95% interval estimate for 9, because 
P(9*—2D-—98-3*--2D)2z:0.95. The lack of a priori information means 
that 5= ә, that is, C—1. Thus 9*— X and D?=o?/n, which is just 
the fiducial interval. Consequently, in the case of multidimensional 
normal distributions, the Bayesian approach results in the same paradox 
as fiducial reasoning does directly. Another paradox of this type is the 
following (it comes from the Moscow statistical school). A machine 
consists of m components in series connections, thus if the i-th compo- 
nent works with probability p; (i=1, 2, ..., m) then the machine works 
with probability p=p, :ps ... Pm. Now taking a sample of m, rs, ..., nm 
elements, it turns out that all of them works perfectly. Using this infor- 
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mation, find an interval estimate of the form 

P(p > p*) — a. 
Surprisingly enough, the confidence interval (i.e., the variable p*) does 
not depend on т, only on 


min 7; = ng 


1si=m 


and on the corresponding probability pọ. At the same time, in the Bayes- 
ian framework, the interval estimate for p depends on m. 

(ii) Fisher (1890—1962) had begun to deal with interval estimates, 
a bit earlier than Neyman (1894—1981). Fisher even accused Neyman, 
who was then working in Poland, of appropriating and enlarging his 
ideas. At that time he had already both personal and professional 
conflicts with other outstanding statisticians. He hated K. Pearson 
(1857—1936) and for this reason he did not publish after 1920 in Bio- 
metrika (the leading periodical of statistics, established and edited, among 
others, by Pearson). Fisher transmitted his antipathy, though lessened, 
to K. Pearson’s son E. S. Pearson (1895—1980) and his friend J. Neyman. 
Later, Neyman became one of the leading statisticians in the USA and 
their dispute turned into an Anglo—American dispute. Fisher had 
never liked the idea of reducing statistical conclusions to decisions with 
loss functions. (This ‘‘American” tendency in statistics was developed 
by the Hungarian Abraham Wald on the basis of Neumann’s game 
theory.) The strong contrast was expressed as follows: In America (cor- 
responding to Peirce’s pragmatism) it is not important what we think 
but what we do. In England it is just the contrary. Fisher, though his 
reasonings are not always convincing, is one of the greatest (if not the 
greatest) statisticians who has ever lived. So it is strange that he was 
never made professor of statistics. He did in fact become a professor at 
Cambridge University in 1943, but in genetics. He also became the pres- 
ident of the Royal Society between 1952 and 1954. 

(iii) We are to estimate the location parameter 9, from the sample 
values X1, Xo, ..., Xn, distributed according to the exponential density 
function e?-* (if x—9 and 0 otherwise). The estimator 
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is unbiased, its probability density function is proportional to (x —9-- 
-1)'73e-"6-9*)) for x—-9—1. Using this density, we can easily de- 
termine the shortest 90% confidence interval. In the case X,=12, 
X,—14, X,—16, this confidence interval is 12.1471<9<13.8264. Оп 
the other hand, 9 is obviously less than Хү= тіп X;—12. Thus the 
shortest 90% confidence interval lies in the region where it is impossible 
for 9 to be! Jaynes emphasizes (see the reference below) that the Bayesian 
solution is the proper way to determine an interval estimation. If the 
prior density is constant, the posterior density of 9 will be re"9—XD 
(if §<Xf and 0 otherwise). The shortest posterior belt that contains 
100P percent of the posterior probability is thus Xf —q-—9-— Xt, where 
q-—n-!log(1—P). For the above sample values 11.23<9<12.0. From 
the “confidence” point of view, one can argue that 9 is not a sufficient 
statistic for 9 while Хү is sufficient. The shortest confidence interval 
based on the sufficient statistic is the same as the Bayesian interval above. 
But even if we work with Хү, it may occur that a 90% confidence interval 
(— о, f (X1) lies in the negative half-line when we know (from prior 
information) that 9 cannot be negative. 
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10. THE PARADOX OF TESTING A HYPOTHESIS 
a) The history of the paradox 


It is very difficult to say anything definite about the first attempts to test 
a Statistical hypothesis but in his monograph B. V. Gnedenko states that 
in ancient China as early as 2238 BC censuses showed that the birth rate 
of boys was 50%. John Arbuthnot (1667—1735), an English matemati- 
cian, doctor and writer, was the first to point out (in 1710) that the hy- 
pothesis of equality of the birth rate of boys and girls must be rejected, 
since, according to the demographic data over an 82 year period (available 
at that time), more boys than girls were born each year. If the probability 
that a newborn baby is a boy was 1/2, the experience of 82 years would 
be so improbable (1/2°*) that it can be considered almost impossible. So 
Arbuthnot was the first who rejected a natural statistical hypothesis. 
This (not mathematical) paradox aroused the interest of Laplace. In 
1784 he was surprised to find that the birth rate of boys was approxi- 
mately equal 22/43 in several different places, whereas the same ratio was 
25/49 in Paris. Laplace was intrigued by such a remarkable difference, 
but he shortly found a rational explanation: the total number of births 
in Paris included all foundlings and the surrounding population had a 
preference for abandoning infants of one sex. When Laplace eliminated 
the foundlings from the total number of births, the birth rate of boys 
came close to the number 22/43. 

In 1734, D. Bernoulli won a prize from the French Academy for an 
essay on the orbits of planets. Constructing a hypothesis test, Bernoulli 
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attempted to show that the similarity of planetary orbital planes would 
have been most unlikely to have occurred by chance. Using the right 
hand rule, each orbit corresponds to a point on a unit sphere, and he 
tested the hypothesis that these points were drawn from a uniform distri- 
bution on the unit sphere. In 1812, Laplace analyzed a similar problem. 
He attempted to apply statistical methods to decide which hypothesis 
should be accepted: that the comets are regular members of the solar 
system or that they are only “intruders”. In the latter case the angles 
between the orbital planes of comets and the ecliptic would be uniformly 
distributed between 0 and z/2, and this was exactly the mathematical 
form of Laplace's hypothesis. (He found that comets are not regular 
members of the solar system.) The modern theory of testing statistical 
hypotheses was initiated by K. Pearson, E. S. Pearson, R. A. Fisher and 
J. Neymann. 

Suppose we have to test the hypothesis that the probability distribu- 
tion of a random variable is F. (In the problem of Laplace, F was the 
uniform distribution on the interval [0, л/2].) For this “goodness of fit” 
problem K. Pearson, H. Cramér, R. von Mises, A. N. Kolmogorov, N. V. 
Smirnov and others who followed them worked out several different 
tests, and it became necessary to compare their efficiency. E. S. Pearson 
and J. Neyman made the first move to solve the theoretical and practical 
problem of finding the best decision methods. First they introduced the 
notion of alternative hypothesis, which is not necessarily the opposite 
of the original, null hypothesis. For example, consider a random variable 
which is normally distributed with unit variance and unknown expecta- 
tion; if the null hypothesis is that “the expectation is — 1" and the alter- 
native hypothesis is that “the expectation is + 1”, then the two hypotheses 
obviously do not cover all the possibilities. In connection with these 
simple hypotheses (where both the null hypothesis and the alternative 
hypothesis contained a single distribution) Neyman and Pearson showed 
in 1933 that there exists a most powerful test in the following sense. When 
astatistical testis performed, two kinds of errors are possible. We may 
reject the null hypothesis when it is true, making a type I error (or error 
of the first kind). On the other hand, we may accept the null hypothesis 
when it is false, making a type IJ error (or error of the second kind). A de- 
cision method (test) based on a sample of given size is called most power- 
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ful if, for an arbitrary fixed probability of a type I error, the probability 
of a type II error is as small as possible. (If the size of the sample is given, 
the sum of the probabilities of the two types of errors cannot be made 
arbitrarily small. This fact is a kind of uncertainty principle in hypothesis 
testing.) Suppose, for simplicity, that both distributions (in the null 
hypothesis and in the alternative hypothesis) have density functions. Then, 
according to the fundamental principle of Neyman and Pearson, there 
is a most powerful test of the following form. Denote by f, and f, the 
density function of the sample X=(X,, Xo, ..., X,) under the null hy- 
pothesis and the alternative hypothesis, respectively. We accept the 
null hypothesis if and only if 


AQ) 
AX) 


(For simplicity we suppose that the probability of f, (X)/fo(X)-—c is 0.) 
The theory of Neyman and Pearson became fundamental in testing hy- 
potheses, but not without paradoxes. Herbert Robbins showed in 1950 that 
there is a test which is in a sense more powerful than the most powerful 
test of Neyman and Pearson. 


— c, Where c is a suitable constant. 


b) The paradox 


Suppose X is a normally distributed random variable with expectation 
9 and variance 1. Let the null hypothesis be 9— —1 and the alternative 
hypothesis 9= +1. On the basis of a single sample element X, the most 
powerful test of the null hypothesis against the alternative hypothesis 
is the following: we accept it and reject the alternative hypothesis if 
Xz0, otherwise we reject it and accept the alternative hypothesis. In 
this case the probability of both kinds of errors is approximately 16%, 


since 
P(X > 0|09=—1)=Р(Х<0|9= +1) = 


— 1 f 
2r û 


If we apply this test in N independent cases, then for large N the expected 
number of false decisions is approximately 0.16N. Since we have used 


ez DE dy = 0.1587 #016 
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the most powerful test in each case, one might think that the average 
number of false decisions cannot be smaller than 0.16N. The following 
method of Robbins shows that, paradoxically, this is not the case. 

Let X be the average of the observations X,, Xo, ..., Xy. Robbins’ 
test is the following: 


if; Kee еп. 4 =1 бка VD А 

if X => 1, ‘then S1 forall, 7=— 12.22, 4. 
and finally 

if —1=X s+1, then 9;—-—1 or 9,=+1, 
depending on whether the inequality 


ا1 
DEE‏ 


ПА 


Xi 


holds or not. This method is very surprising because it connects inde- 
pendent problems. For large N, (e.g., for N=100), if the true ratio of 
9= +1 to 9,— —1 is 0 or 1, then Robbins’ procedure decides with 
100% certainty; for a ratio 0.1 or 0.9 the probability of error (of both 
types) is 7%; for a ratio 0.2 or 0.8 the probability of false decisions is 
1196; for a ratio 0.3 or 0.7 it is 14% and even for the ratio 0.4 or 0.6 the 
percentage of errors is still smaller than the level 1696 of the most power- 
ful test. Robbins’ method becomes less efficient than the most powerful 
test only in the case of ratio near to 0.5. 


c) The explanation of the paradox 


Robbins’ paradox shows that even when we have to make decisions 
about accepting or rejecting products from different factories working | 
independently, the total number of false decisions will be fewer on aver- 
age, if we do not make our decisions independently of each other. 
Since this is essentially the same problem as Stein’s paradox on admissible 
estimators of the expectation, here we only refer to its explanation in 
Section 2 and to Robbins’ fundamental paper. 


Paradoxes in mathematical statistics 117 


d) Remark 


Other paradoxes on hypothesis testing will be discussed in Sections 12 
and 13 among the quickies. 
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11. RÉNYI'S PARADOX OF INFORMATION THEORY 


a) The history of the paradox 


One of the main tasks of information theory is to measure the amount 
of information. The pioneers of this discipline of mathematics (C. Shan- 
non, N. Wiener and others) realized that the amount of information is 
measurable by a scalar independent of the actual meaning and form of 
information, like the volume of liquid is independent of its shape. The unit 
of information is the information content of an answer “уез” or “по”. 
In binary code this information can be given by a single digit (e.g., 1 
for “yes”, 0 for “no”), which is called a bit (abbr. of binary digit). What 
makes this abbreviation especially suitable is its meaning as a normal 
word. The content of information is measured by the average amount 
of binary numbers needed to express the information. If a random 
variable can take only a finite or countably infinite number of values 
with positive probabilities p; , ps, ... then, according to Shannon’s formula, 
its information content is 


Н = H(p, Pa, …) = 2 plogpi* (bit) 


9 Székely 
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where (in this passage) log stands for logarithms with base 2. H is called 
the entropy of the probability distribution p, Pa, рз, .... This is the 
average length of the most economical code combinations by means 
of which the outcomes of events with probabilities 71, Pa, ... can be 
described. Another important notion of information theory is the infor- 
mation gain. If the observation of a random variable (or event) changes 
the probability distribution 7ı, рз, рз, ... tO qı, 4, 9з, ... then the 
amount of the information gained is 


Now let the unknown parameter 9 of a probability distribution be a 
random variable (according to the Bayesian view of mathematical sta- 
tistics). For simplicity, suppose that $ can take only a finite or countably 
infinite number of values with probabilities pı, pa, рз, .... Thus the 
entropy of 9 is H(9)— H(pi. P2, pa ...). Suppose moreover that the 
random sample X=(X,, Xz, ..., Xn) can also take only finite or count- 
ably infinite number of different values with positive probabilities 
Q1» Q2; 43, .... Finally, let r;, denote the probability that 9 takes the jth 
value (whose probability is p;) and at the same time X the kth value (whose 
probability is q,). Then the amount of information concerning 9 and 
obtained by observing X is 


I(X, 9) = Dry, log 
jk 


Fik 
j 


Pj dk 


A function f(X) -f(X;, Xs, ..., Х,) of the sample X is called sufficient 
if I(f(X), 9) —1(X, 9), i.e., if f(X) contains as much information con- 
cerning 9 as the original sample X does. If fis not necessarily sufficient, 
the ratio /(f(X), 9)/1(X, 9) gives the proportion of information con- 
cerning 9 that can be obtained from the sample if f(X) is used instead 
of the complete sample. The property that, by taking more and more 
observations, we can obtain at last all the information concerning 9 
can be expressed in the language of information theory as follows. If 
the observations X,, X;,... are independent, identically distributed 
random variables whose distribution F, is different for different values 
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of the parameter 9, then 


lim (06, Xs, ..., X,), 9) = HO). 


A. Rényi’s paradox discussed below comes from the application of 
information theory in testing hypotheses. 


b) The paradox 


By observing the random variable X which is in connection with the event 
A, we would like to guess whether 4 has occurred or not. If the proba- 
bility of A is P(A)=p then the content of information of the event A is 
H(p, 1 — p). Having observed the variable X, the amount of information 
still missing is Hy —E(H(P(A|X), 1—P(A|X))), where P(A|X) stands 
for the conditional probability of the event A given X. Consequently, 
the content of information concerning A if X was observed is 


I(X, Y) = H(p, 1—p)—Hy. 


Observing X, let d(X)=1 if we decide that A has occurred and d(X)=0 
in case of the complement of A, i.e., A. The probability of a wrong deci- 
sion (error) is 


à = pP(d(X) = 0|A)+ (1 — p) P(d(X) = 14). 
It is easy to prove (e.g., by the fundamental result of Neyman and Pear- 
son; see II. 10.) that no decision can have less error 6 than the following 
“standard decision”: 
а 1. if P(A|X) PAX); 
XX) = lo if P(A|X) < P(A|X). 
If P(A|X)-P(A|X) then let d(X)=1 with probability p and 0 with 


probability 1—p. The paradox appearing here is the following. Let 
Y=d,(X). In this case the information content of Y concerning А is: 


Y is a function of X, therefore 7(Y, А)=1(Х, A). The equality holds if 
and only if P(A|X) can take only two different values, i.e., generally, X 


9t 
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contains more information concerning 4 than Y. Still, knowing X, we 
cannot make a better decision concerning 4 than if we know only Y— 
=d,(X). From this it follows that while X contains generally more in- 
formation on 4 than Y, it is impossible to utilize this extra information. 


c) The explanation of the paradox 


The extra information can be utilized by observing another random var- 
iable. Let, e.g., Z=X+U, where U is the indicator variable of the event 
A. This means that U—1 if А occurs and zero otherwise. Obviously, 
by observing X and Z simultaneously, we get full information concerning 
A that is the latent extra information in the value of X concerning А can 
be made free by observing the auxiliary variable Z. 


d) Remarks 


Information theory is in close connection with several practical problems, 
e.g., with the optimal methods of telecommunication or the foundations 
of biology (see the references below). 
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12. THE PARADOX OF STUDENT'S ;-TEST 
a) The history of the paradox 


In the classical theory of mathematical statistics the sample elements 
(observations) were considered to be given in advance. One of the most 
important branches of modern statistics ıs based on the recognition that 
the sample size should not be fixed in advance; instead it should depend 
on the result of earlier observations. Thus the sample size also depends 
on chance. This idea of sequential sampling evolved gradually from the 
results of H. F. Dodge and H. G. Romig (1929), P. C. Mahalanobis 
(1940), H. Hotelling (1941) and W. Bertky (1943), but the real founder of 
the sequential theory of mathematical statistics was А. Wald (1902—1950). 
His sequential likelihood ratio test (1943) was a decisive discovery which 
enabled (in typical cases) a 50% saving on the average number of obser- 
vations (with the same probabilities of errors). No wonder, Wald’s 
discovery was classified as “restricted” during World War II. His funda- 
mental book ‘‘Sequential analysis” was published only in 1947. A year 
later Wald and J. Wolfowitz proved that no other method can save 
more sample elements than the sequential likelihood ratio test. Paradoxes 
found their way into this field, too. Here we shall discuss the paradox 
noted by C. Stein, though it refers only to a two-stage decision, not a 
sequential one. 


b) The paradox 


Let Xi, X;,..., X, be a sample of independent, normally distributed 
random variables with the same unknown expectation 9 and the same 
unknown standard deviation c. On the basis of this sample, we want to 
decide between the following null hypothesis and alternative hypothesis. 
The null hypothesis states that 9—9, (where 9, is a given number), 
while the alternative hypothesis states that 949). Let 


[m 


and 
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The usual way of making a decision between the two hypotheses is the 
Student's t-test. According to the t-test, the null hypothesis should be 
accepted or rejected depending on whether the value of t, is near enough 
to 0 or not. G. B. Dantzig showed in 1940 that given the probability of 
the type I error, the probability of the type II error depends on the un- 
known standard deviation o for any decision method. Paradoxically, five 
years later C. Stein proved that if the sample size n was not fixed in ad- 
vance but depends on the sample elements which have already been 
chosen (as in Wald's sequential analysis), then there does exist a t-test, 
where (given the probability of the type I error) the probability of the 
type II error does not depend on the unknown standard deviation с 
(it depends only on the difference 9—9). 


c) The explanation of the paradox 


In the first step choose a sample Xi, Xp, ..., where mo is a fixed 


number. The empirical sample variance is 


no? 


52 = 


822—2, (3 хә}. 


1 
По — 1 


Let the size n of the entire sample depend on the magnitude of s and on 
a previously fixed number z, in the following way: 


52 
Tu max (a. m+}, 


where the brackets [] denote the integer part of a real number. Choose 
the positive numbers a,, az, ..., a, such that 


4 no 


n 
ар= 1, а= а, =...=а 
і=1 


апа 


n 
o> Ug 
{=1 
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and decide between the hypotheses on the basis of the following statistics : 


eO me Meng 


yz 
Х а0,-9) 
а 


Obviously, if s is given, the random variable t is normally distributed with 
expectation 0 and variance 


where 


[== 


п 


o* > aj/z = 04]s*: 


t= 


m 


On the other hand, the distribution of (n, — 1)5?/o? (for arbitrary о) is the 
same as the distribution of the sum of squares of ny 1 independent stand- 
ard normal random variables (the ond chi-square distribution) which 
is independent of c. Therefore the distribution of t is also independent 
of c, so t* depends only on 9 — 8, and not on c. 


d) Remarks 


(i) The random variable t, is not normally distributed because D* is not 
a number, but a random variable. (If the value of the standard deviation 
were known and we substituted this value for D", then t, would have a 
standard normal distribution.) This remarkable observation and the 
analysis of the random variable t, was published in 1908 by Student, 
alias William D. Gosset. (He worked for the Guinness brewery in Dublin 
from 1899 and his boss insisted that Gosset should write under a pseudo- 
nym.) For a long time nobody recognized the importance of Student's 
paper. (According to Student, even as late as 1922 R. A. Fisher was the 
only person who ever applied the t distribution; in fact it was Fisher who 
denoted Student's distribution by ¢ for the first in his book published in 
1925. Student himself used the letter z to denote, not exactly ¢,, but 


(n—1)t,.) 
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(ii) The optimal stopping of sampling sequences in sequential analysis 
was the root of the modern theory of optimal stopping of different pro- 
cesses. If we consider sampling as a process, we connect mathematical 
statistics and the theory of stochastic processes, which will be discussed 
in the following chapter. This connection proved an advantage for both 
areas. Nowadays Wald's fundamental theorems in sequential analysis 
are special cases of the general theory of stopped stochastic processes 
(see the book by Shirjaev). 
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13. QUICKIES 
a) The paradox of the typical and average 


The notion of average, e.g., the average salary is often used as a synonym 
of typical. As a matter of fact, if there are only a few extremely rich and 
a great many poor families in a certain country having correspondingly 
enormous or small incomes then the arithmetical means of their incomes 
is not at all typical. The median of incomes, e.g., gives a much more re- 
alistic picture. (The median means that just the same number of people 
have incomes larger than the median as smaller.) Besides average salary 
there are other misleading averages. One of these is the “average man” 
(l'homme moyen). It is no wonder that the Belgian L. A. J. Quételet’s 
study on this subject became the source of stormy debates. The worst 
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about the “average man" is not his greyness but the discrepancies that 
arise. E.g., the average height does not correspond to the average weight, 
etc. For this reason alone we have to doubt the truth of the words of 
J. Reynolds (the first president of the Royal College of Fine Arts) when 
he said that the source of beauty is the average. 


(Ref.: Quételet, L. A. J., Essai de Physique sociale, (1835); L'homme moyen, Phy- 
sique Sociale, Vol. 2, Bruxelles, 1869.) 


In spite of its inconsistencies, Quételet's book of 1835 is considered 
a milestone if not the starting point in the quantitative analysis of human 
social properties. F. Galton, K. Pearson and F. Edgeworth all appreciated 
Quételet as the genius pioneer of regression type thoughts. It was due 
to his book that Galton began his scientific research. Quételet, however, 
had other scientific merits, too. In 1820 he founded the Royal Belgian 
Observatory and became its first director. He was an excellent organizer 
too: the Statistical Society in London was set up at his suggestion in 
1834, and it was also he who suggested that the first International Con- 
gress of Statistics should be convened in Brussels in 1853.) 


b) The paradox of estimation 


The square of an estimate is generally not the same as the estimate of the 
square. If, e.g., a parameter is estimated by X, that is, the mean of the 
observed values X,, Xs, ..., X, then the obvious estimate of the square 
parameter is X?, which generally differs from the mean of the square 
of the observed values. The same is true if the square is replaced by any 
nonlinear function. 


(Ref.: Carnap, R., Logical Foundation of Probability, Routledge and Kegan Paul Ltd. 
Broadway House, London, 1950.) 


c) The paradox of accurate measurement 


Our task is, e.g., to determine the length of two different rods by two 
measurements. The instrument we may use measures length with random 
error whose standard deviation is c. Paradoxically, the best method is 
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not measuring the rods one by one. The standard deviation of the result 
will be less if firstly the total length (Т) is measured by putting the rods 
end to end and then side by side and so measuring the difference of 
their length (D). The approximate length of the rods is 
que and Dh: respectively. 
2 2. 

Тһе standard deviation of these lengths is c/ V2 which is really less 
than c. 


(Ref.: Hotelling, H., “Some improvements in weighing and other experimental 
techniques", Annals of Math. Statist., 15, 297—306, (1944).) 


d) The paradoxical estimation of probability 


The usual estimation of an unknown probability is the relative frequency. 
For example, if we toss a coin a hundred times and obtain tails 47 
times then the probability of tossing tails is estimated at 47/100. However, 
if we toss a more or less fair coin 10 times but do not obtain any tails, 
it is unreasonable to consider the probability of tails to be 0. If we have 
some a priori information (e.g., the coin is more or less fair) then 
estimating by the relative frequency is generally not the best method. Our 
a priori information can be well expressed by the beta distribution de- 
pending on two parameters a and b. (The density function of the beta 
distributions is 0 outside the interval (0, 1) and proportional to x^^! 
(1—x)*7! on (0, 1); (a—0, Ь> 0.) The expected value and the variance of 
the beta distribution is 


a ab 


"f go SO LES Rae De 


respectively. 


Thus solving this system of equations, our a priori information concerning 
m and d can be expressed by a and b (e.g., if the coin is fair then m=1/2, 
thus a=b). If the a priori distribution is beta with parameters (a, b) 
then, by Bayes' theorem, the a posteriori distribution will be of beta type, 
too. (This property makes the beta distribution widely applicable.) 
If an event with unknown probability occurs k times out of n experiments 
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then the parameter of the a posteriori beta distribution will be (a+k, 
b+n—k), thus the a posteriori expected value is 


_ . atk 
© atb+n’ 


which contains more information and is a better estimation of the 
unknown probability than the relative frequency k/n. Of course if n 
is large enough then M hardly differs from the relative frequency, but, 
€.g., іп case n=10,k=0 and a=b=100, we get 


100 
M= 510 ^ 0.48, 


whereas the relative frequency is 0, which is absolute nonsense. 


(Ref.: Good, I. J., The Estimation of Probability, MIT Press, Cambridge, 1965.) 


e) The more the data the worse the conclusions 


Quite obviously, more data enables us to calculate better results. The 
following paradox, however, seems to show just the contrary. Let X,, X2, 
and X, denote independent random variables and suppose that the dis- 
tributions of X, and X; are the same: both X, and X, are equal either to 
0 or to 2 with the same probability, hence both have the same expected 
value, namely 1. Let further Хз be equal either to 1 or to 2.5 with equal 
probability, so its expected value is 1.75. All this information is unknown 
to a mathematician who takes samples from these distributions in order 
to select the one with the greatest expected value. The most obvious 
choice is the distribution whose sample mean is the greatest. Take first 
a sample of a single element from every distribution. The probability 
of the correct selection is then 


Р(Х, < X; and Xo 一 X4) 一 


co| c^ 
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Now what happens if one of the sample sizes (e.g., that of Хз) is increased 
to 2 (the others remain unchanged)? The probability of the correct 
selection is then: 


Р(Х, < 6 апа ¥ -X)-u 


where X, is the arithmetical mean of the two elements of the sample. Thus 
the probability of the correct selection has decreased from more than 
5096 to less than 50%. 


(Ref.: Chius, W. K., Lam, K., The American Statistician, 1975.) 


F. Y. Edgeworth's famous paradox (1883) concerns a similar problem: 
if XY, and X, are independent random variables with the same density 
function f(x —9) symmetrical to 9 then it could easily happen that X; 
is closer to 9 than X=(X,+X,)/2 in the sense that 


P(|X,—9| = e) > P(|X—9| = e) 
for any positive г. This is the case, e.g., when 


3 1 
IgM 


because the density function of X, is greater, at the point 9, than that of 
X: 


(Ref.: Stigler, S. M., “An Edgeworth curiosum”, Annals of Statist. 8, 931—934, 
(1980).) 


f) The paradox of equality of expected values 


Let the expected values of three normal random variables with the same 
variance be m,, m; and тз. It can happen that, applying, e.g., Student's 
t-test, we accept the hypotheses m,=m, and m,=m (at a certain con- 
fidence level) but reject m, = тз! (The problem of equality of expected 
values is tested on the basis of two n-sized samples by the statistic 


X-Y 
(oS oe 


ZI 
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where X and Y are the sample means and D* is the empirical standard 
deviation. The statistic t has a Student's distribution with the parameter 
2n —2.) This paradox was the starting point of many researches on simul- 
taneous testing (analysis of variance, etc.). 


(Ref.: Hodges, J. L., Lehmann, E. L., “The efficiency of some nonparametric com- 
petitors of the t-est", Annals of Math. Statist., 27, 324—335, (1956). 
Scheffé, H., The Analysis of Variance, Wiley, New York, 1959.) 


g) A paradoxical estimation for the expectation of a normal distribution 


We wish to estimate the unknown expected value of a (one-dimensional) 
normal distribution with unit standard deviation from an n-sized sample. 
It is known that the arithmetical mean X of the sample is an estimator 
of many favourable properties. It is, e.g., unbiased, has minimal variance, 
admissible, and minimax under the quadratic loss function. In spite of 
these properties, if our aim is only to give as close estimator for 9 as 
possible with given probability then there exists a better estimator 9, 
те: 


P(8—9| = -多 = 了 


for any possible 9. This type of estimator is the following 


8 = Xl min (RF, &(- yn X) 
2үп 
if Xz 0 and 


8- F+ min (Yn X, Ф(-үпх)) 


if X=0, where Ф denotes the distribution function of the standard 
normal distribution. 
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h) A paradox on testing normality 


We want to test whether a given sample X1, Xo, ..., X, may ог may not 
come from a distribution with continuous distributon function F(X). 
Let 


EQ-l 51 
П i:X,;<x 
denote the empirical (cumulative) distribution function of the sample. 
According to Kolmogorov’s theorem, if the hypothesis is true then 


lim P (Wasup |F,(x) — F(3)] < 2) = 


= > (-1y e-9/2 = K(z). 
了 = 二 co 

Using this theorem, it is easy to construct a test with confidence level «. 
(If Үп sup |F,(x) —F(x)| exceeds a critical value 2, for which K(z))=a 
then the hypothesis will be rejected.) If the normality of a probability 
distribution is to be tested, then first the expected value and the standard 
deviation of the hypothetical normal distribution should be estimated 
from the sample by the usual X and D*. Secondly, the above Kolmogorov 
test should be applied for a normal distribution F(x) with the expected 
value X and the standard deviation D*. Then we might think that if n 
is large enough the substitution of the unknown parameters by X and 
D* does not cause any essential difference. The difference is, however, 
significant. E.g., at a 95% confidence level the critical value zy in Kolmo- 
gorov's test is 1.36, while a precise analysis would show that the correct 
critical value is 0.9. The explanation of this paradox is rather simple. 
Due to substitutions, F(x) and the empirical F,(x) have come closer to 
each other, so it is advisable to choose a smaller critical value. 


(Ref.: Durbin, S., “Some methods of constructing exact tests", Biometrika, 48, 
41—45, (1961). 

Kac, M., Kiefer, J., Wofrowitz, J., “On tests of normality and other tests of goodness 
of fit based on distance methods", Annals of Math. Statist., 26, 189—211, 
(1955). 

Sarkadi, K., “Оп testing for normality", Proc. 5th Berkeley Symp. on Math. Statist. 
and Prob., Y. 373—387, 1967.) 
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i) A paradox of linear regression 


Suppose that a random quantity X can only be measured with an error 
é having an expected value of 0. In other words, the result of the measure- 
ment Y —X-re is a very simple linear regression on X. Is there a “better 
estimator" for X than the measured Y? Surprisingly, in some special 
cases the answer is affirmative. At least, there exists an estimator X for 
which E(X — X)? is less than E(Y — X)? — Ee. Suppose, e.g., that X and 
8 are uncorrelated and the regression function of X on Y is also linear. 
Then 
g DG D*(X) 
D*(Y) D*(Y) 


is a better estimator than Y. (In the extreme case of D?(e)=0 we get 


f-Y) 


4 


E(X)+ 


j) Sethuraman’s paradox 


There exists statistical functions 4 and B such that the unbiased estima- 
tor of the unknown parameter 9, based on 4 have smaller variance than 
the estimator based on B (whatever the true value of 9); on the other 
hand, when testing the null hypothesis $= 9, (e.g., against the alternative 
hypothesis 9—3,), a test based on the function A is not necessarily 
better than a test based on В; the latter can be better locally (in a neigh- 
bourhood of the null hypothesis). If, for example, the sample elements 
X,, Xa, ..., X, are uniformly distributed on the interval (9; 29), the maxi- 
mum likelihood estimator of 9 is 


i max (Xis Xass X, 


and a slight modification leads to the unbiased estimator 


Qn? 
ПЕРИ; 


The following estimator A is also unbiased but with smaller variance. 


п+1 
5n+4 


A= (4U+V), where V = min(X,, X,, ..., Xn): 
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To test the hypothesis 9=9,, however, the method based on В is locally 
more powerful. 


(Ref.: Sethuraman, J., “Conflicting criteria of ‘goodness’ of statistics", Sankhya, 
23, 187—190, (1961).) 


k) A paradox on minimax estimation 


The notion of minimax estimation was introduced in Remark (ii) of the 
Paradox I. 12. Minimax estimations usually suit common sense. The 
following example of H. Rubin, however, shows the contrary. The only 
minimax estimator of the unknown probability p0 is the identically 
0 estimator if the loss function is L(p, c) = тіп ((p—c)*/p®; 2). So, no 
matter what the sample was it is reasonable to estimate the unknown 
parameter by O (a value which was ruled out in advance among the 
possible values of p). 

Remark: If the loss function is L(p, c)=(p—c)*, the minimax estima- 


tor is 
үз 
x+ = 
n 2 


= EM : 


where л is the sample size and x is the frequency of the event with un- 
known probability. 


1) Robbins’ paradox 


It is well-known that the “best” estimator of the parameter of a Poisson 
distribution on the basis of a single observation X is just X. (This is a 
minimum variance unbiased, maximum likelihood estimator.) But how 
can we estimate the parameters 9,, 9,, ..., 9, of k independent Poisson 
distributions on the basis of the corresponding observations X;, Xo, ... 
..., X, if we want 


Е( > (8-9) 
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to be minimal? Is there a better estimator than §,=X,? It was H. 
Robbins who first pointed out that, though the k Poisson distributions 
are independent, it is still possible to find better estimators which take 
into account not only the observations of “their own" (i.e., correspond- 
ing), but also of the others. According to Robbins, if k is large and N(X) 
denotes the number of observations which are equal to X, then the esti- 
mator $,2(X,--1) N(X;--D/N(X;) is better than 9,=Х,. The essence 
of the paradox is the following: it is possible that observations which 
have nothing to do with a parameter can influence its good estimations 
(cf. Paradox II. 2. (ii)). 


(Ref.: Robbins, H., “An empirical Bayes' approach to statistics", Proc. 3rd Berkeley 
Symp. on Math. Statist. and Prob. I, 157—164, 1956.) 


m) A Bayes model paradox 


Let the density function f(x) of a random variable X be the mixture of 
two positive density functions f(x) and f(x): 


fh) = р(х) +(1-р)Л (х), where 0 =p =1. 


The value of p is unknown and we hope that we can determine it as pre- 
cisely as desired if n is large enough, on the basis of the independent ob- 
servations X;, X2, ..., X, (the distributions of Xs and X are the same). 
We wish to solve the problem using Bayes' theorem: we choose a number 
Do; 0<po<1, and assume that the a priori density of X is ро fo(x) - (1 — 
— po) fi CX). Then the a posteriori density of X (having observed the sample 
Xis Xas ..., Xn) is: p, fo(X) t (1 -- p,) A (х), where 


Pn Po 7 f(X) 


l—Pa 1—Ppo і A(X)” 


The sample elements would really determine the value of p as precisely 
as we wish if 


im Pu ee 1m (with probability 1). 


næ pu 1 


10 Székely 
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This equation, however, does not always hold; if, e.g., the expectation of 


F(X) 
108 OO 


is 0, then by the Chung—Fuchs theorem (cf. III. 7. b.) 


lim sup =o and lim inf 7 220) 
n= оо 一 一 п- со == ра 
therefore 
lim 2" 
1—p, 


does not even exist in this case. The paradox vanishes if instead of 
(ро, 1 — po) we choose an a priori distribution which has a positive density 
function on the whole interval 0<p<1. This model is more advan- 
tageous since it takes the actual f, into consideration with positive density. 
(Ref .: Berk, R. H., “Limiting behaviour of the posterior distributions when 
the model is incorrect", Annals of Math. Statist., 37, 51—58, (1966).) 


n) А paradox of confidence intervals 


Let X,, X2, Хз, ... be normally distributed random variables with а 
common expectation m and unit variance, and let S, denote the following 


sum: 
Sa = Хх ++... +X,. 


The probability that for any fixed n 


CEPS S, +2 Yn 
— <= mm M 
n n 


is approximately 95%, whereas the probability that the inequalities hold 
for every n is 0. The latter probability remains 0 even if we substitute an 
arbitrary large number for 2. (cf. Robbins, H., “Statistical methods 
related to the law of the iterated logarithm”, Annals of Math. Statist., 41, 
1397—1409, (1970).) 
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0) A paradox of testing independence; is an effective medicine effective? 


The three tables below indicate the effect of a certain kind of medicine 
when it was taken only by men, only by women, and by the two sexes 
together (combined results). The tables show that the recovery-rates are 
better after medicinal treatment both among men and women. (The 


MEN 


After medical 
treatment 


Without medical 
treatment 


Recovered 


Not recovered 


WOMEN 


After medical 
treatment 


Without medical 
treatment 


Recovered 


Not recovered 


COMBINED 


After medical 
treatment 


Without medical 
treatment 


Recovered 


Not recovered 


Figure 11. 
10* 
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significant difference can be shown statistically by independence tests.) 
On the other hand, the table of combined results indicates, surprisingly, 
that the rate of recovery is better among those people who did not take 
the medicine. So the medicine which proved to be effective both among 
men and women gave a negative result when a mixed group of men and 
women were treated with it. Similarly, a newly discovered medicine may 
be found to be effective in each of ten different hospitals, but the com- 
bined list of experiments shows the medicine to be worthless or of negative 
effect. (Ref.: Pflug, G.: *Paradoxien der Wahrscheinlichkeitsrechnung", 
in: Stochastik im Schulunterricht, Wien, Teubner, 155—163, 1981.) 


p) Paradox of computer statistics 


The face of statistics has been changed by computers since the 1950's. 
Without computers scientists were forced to use oversimplified models 
even if these models were unrealistic. In the last thirty years, however, 
any statistical decision that a computer can calculate in a relatively 
short period has become “easy”. Thus many "stable" ("robust") and 
multivariate methods with an enormous quantity of operations entered 
the practice of everyday statistics. At the same time statistics has become, 
at least partly, an empirical science: computers can generate millions 
of data in a few minutes and using them we can “test” most new methods. 
Many "empirical theorems” were put into practice without firm theoret- 
ical basis. On the other hand, the theory of robust statistics (see, e.g., 
Huber, P. J., Robust Statistics, Wiley, New York, 1981) gives the theo- 
retical background for many empirical “dirty tricks" in the practice of 
statistics. For the controversies and paradoxes of this new period we 
only refer to the following outstanding papers. 


Efron, B., “Bootstrap methods: another look at the jackknife", Annals of Statist., 6. 
1—26, (1979). 

Efron, B., “Computers and the theory of statistics: Thinking the unthinkable”, SIAM 
Review, 1979 Okt. 

Hampel, F. R., “Robust estimation: A condensed partial survey”, Zeitsch, Wahrsch, 
theorie vrw. Geb., 27, 87—104, (1973), 

Miller, R. G., “Тһе jackknife-a review", Biometrika, 61, 1—15, (1974). 

Tukey, J. W., “The future of data analysis", Annals of Math. Statist., 33, 1—67, 
(1962). 


Chapter 3 


Paradoxes of random processes 


“But next in order I will describe in meet ia all manner of ways, and to try 


what ways that assemblage of matter 
which you see has established earth and 
sky and the ocean deeps, and the courses 
of sun and moon. For certainly it was 
no design of the first-beginnings that 
led them to place themselves each in its 
own order with keen intelligence, nor 
assuredly did they make any bargain 
what motions each should produce; but 
because many first-beginnings of things 


all combinations, whatsoever they could 
produce by coming together, for this 
reason it comes to pass, that being 
spread abroad through a vast time, by 
attempting every sort of combination 
and motion, at length those come 
together which being suddenly brought 
together often become the beginnings 
of great things, of sea and sky and the 
generation of living creatures." 


in many ways struck with blows and 
carried along by their own weight 
from infinite time unto this present, 
have been accustomed to move and to 


(Lucretius, De Rerum Natura, 
Book V, 416—431, Trans, W.H.D. 
Rouse) 


The first remarkable results in the theory of random processes—or sto- 
chastic processes to use a term of Greek origin—arose only in the last 
century. In the 17th and 18th centuries the chief tendency in investiga- 
tion was to examine deterministic processes—due especially to the succes- 
ses of classical mechanics. The “mechanical deterministic" aspect of 
science, which identified chance with unimportance and aimed to elimi- 
nate chance from basic sciences if possible, also evolved at that time. 
In the second half of the last century, however, the mathematics of ran- 
dom processes gained ground gradually in every fundamental branch of 
science, also in physics through statistical physics, and played an essen- 
tial part in 20th century quantum physics. As the profundity of scientific 
cognition increased, the indispensability of stochastic processes became 
more and more evident. 
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1. THE PARADOX OF BRANCHING PROCESSES 


a) The history of the paradox 


In the first half of the previous century an interesting phenomenon was 
noticed, namely the gradual extinction of several famous common and 
aristocratic family names. This problem was studied mathematically by 
I. J. Bienaymé in 1845 and de Condolle in 1873. In 1874 Galton and 
Watson published a paper of fundamental importance on this subject. 
The branching chain of family names became the first example of the ran- 
dom branching process. This type of process appeared in chemistry, 
physics, and in several other areas. E.g., in nuclear physics the process 
of neutron multiplying or chain reactions can be modelled as branching 
processes. Neutron generations, however, follow each other much more 
often than human generations, but in both cases the main question is the 
same: under what conditions will the process die out (the family name 
become extinct) or increase to infinity (the bomb blow up). The notion 
of branching process was coined by A. М. Kolmogorov and N. A. Dmitriev 
in 1947. 


b) The paradox 


Let po, ру, P2, ... denote the probability that an adult man has 0, 1,2, ... 
sons. Calculate the probability g that after some generations there 
remain no male offspring (extinction). Let the generating function of the 
probability distribution ро, ру, р», ... be defined by 


g(z) = DPZ 
к=0 


where |2|=1. Denote the similar generating function in the nth gen- 
eration by g,(z). (g:(z)=g(z).) Then, one can easily see that г„+1(2)= 
=g(g,(z)), i.e., the generating function can be obtained by successive 
function iterations of g(x). The probability that there remain no male 
offspring in the nth generation is: 


4, = gx(0). 
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Since q, is a monotone increasing sequence limg,=q exists and so 
qn+1=8 (qn) implies 


q = g(q). 


Consequently, the probability q can be calculated from this equation. 
Since q—1 is always a root of the equation, Watson supposed (wrongly) 
that the probability of extinction is always 1 and therefore is unavoidable. 
Though Watson's result is completely unbelievable, it was not until 
the 1920s that R. A. Fisher, J. B. Haldane, J. F. Steffensen and others 
showed that the equation has another root, too, which is less than 1 
provided that the average number of the sons to be born: 


SUN 


is greater than 1. In this case the smaller root gives the actual probability 
of extinction. On the other hand, it is no wonder that in the case when m 
is less than 1, the probability of extinction is 1. A paradox may only arise 
іп case m=1. Supposing that every man has just one son on the average 
(т=1), the probability of extinction is still 1 (except the degenerate 
case p,—]1). Therefore, in spite of the fact that the average number of 
male offspring remains unchanged over generations (it is always 1) the 
extinction is unavoidable (more precisely its probability is 1), though one 
can show that the expected time passing till the extinction is infinite. 


c) The explanation of the paradox 
The equations 
q= lm 4, =1 and m=1 
n= со 
do not contradict each other. The first equation means that the proba- 


bility of a male baby is nearly 0 in the nth generation, but if there are 
some, then their number may be large, so the average can easily be 1. 
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d) Remarks 


The Galton—Watson model is generally used in the special form when 
p, ab*^! (k=1, 2, ... and pp=1—p,—p.—... where a and b are posi- 
tive numbers and a is less than 1 — b. In this case g,(z) is a simple quotient 
of linear functions. In 1931 A. J. Lotka calculated the above values 
concerning the USA. He obtained a=0.2126, b=0.5893 and po= 
=0.4825, so the probability of the extinction of a male line was g= 
=0.819. Nice old family names are gradually becoming extinct and being 
replaced by more common dull ones like Smith etc. Even the use of 
combination of two or three names is not always enough to avoid finding 
identical names, even in one office. The following genetical type naming 
would be challenging, fair and symmetrical between the two sexes. 
Each child would inherit two family names, one from the mother and 
one from the father. Since both parents would also have two family 
names, they could select their less common (or more attractive) names 
for the child. Besides these two family names they would, of course, 
have the first name (or names) as well. Due to this method, our world 
of names would become more colourful and characteristic. 
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2. MARKOV CHAINS AND A PHYSICAL PARADOX 
a) The history of the paradox 


The concept of Markov chains is due to A. 4. Markov, a Russian mathe- 
matician, whose first paper on this topic was published in The Notes 
of the Imperial Academy of Sciences of St. Petersburg in 1907. He 
applied this new concept to the study of the statistical behaviour of the 
letters in Onegin, the famous poem by Pushkin. The notion “Markov 
chain" is the most important mathematical notion that originated (at 
least partly) from linguistics. A sequence (chain) of discrete-valued 
random variables Х,, Xo, ..., X,, ... is called a Markov chain (by defi- 
nition), if for any initial time t the future (after-t) “behaviour” of the 
sequence depends on the past (before-t) “behaviour” only through the 
value X,, i.e., 


Р(Х, +1 x й = і, Хү- = 1-1, si) E 
= Р(Х, = і,+11Х, = i) 


holds for every possible values i,,,,i,,... of the random variables, 
that is, for every possible state. This type of sequence occurs in many 
fields, e.g., in classical physics, where the future development of a sys- 
tem is completely determined by its present state (e.g., by the instanta- 
neous velocity and position), independently of the way in which the 
present state has developed. If {X,} is a Markov chain and the conditional 
probabilities P(X,,,—i,,,]X,—ij)), the transition probabilities, are in- 
dependent of t, then the Markov chain is called homogeneous. The tran- 
sition probabilities of homogeneous Markov chains can be arranged in 
a matrix A=(p,;), where 


Dij = P(Xi41 = j|X, = i). 


The nth power of this transition matrix is A"=(p), where p®= 
—P(X,,,—J|X,—i). This relation allows us to utilize matrix-theory in 
the theory of Markov chains. Nowadays Markov chains (and their 
generalization for continuous time parameter and continuous phase 
space, the Markov processes) are much more important for natural and 
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technical sciences than for linguistics, where they were originally ap- 
plied. 

The problem of reversibility-irreversibility is an interesting paradox 
of classical mechanics and thermodynamics and Markov chains are 
efficient means of studying it. The essence of the problem is that the 
laws of classical mechanics are reversible, so they cannot explain why a 
cube of sugar dissolves in coffee and why we have never observed the 
reverse process. The second law of thermodynamics, however, (which 
was first formulated by L. S. Carnot) expresses the irreversibility of our 
world. (The first law of thermodynamics expresses the principle of con- 
servation of energy.) Forty years later R. Clausius introduced the mathe- 
matical form of entropy, which is fundamental in the theory of irrever- 
sible processes. (According to Clausius, [Memoir read at the Philos. 
Soc. Zürich, April 24. Pogg. Ann. 125:353, 1865] the word “entropy” 
comes from the Greek тоолу, meaning “a turning", or “а turning point". 
Clausius also states that he added the “еп” only to make the word 
sound like “energy”, though the word evtgozy itself has a meaning, name- 
ly “to turn one's head aside".) By means of entropy the second law 
of thermodynamics can be formulated as follows : in the case of an isolated 
system the entropy can never decrease, usually it tends to increase. This 
law was aimed to verify by L. Boltzmann using the kinematics of atoms 
and molecules. (At that time Boltzmann's idea was not natural at all 
since many physicists doubted even the existence of atoms, e.g., M. Fa- 
raday, E. Mach or W. F. Ostwald, who was the founder of energetics.) 
Boltzmann was strongly influenced by Maxwell's work on the dynamic 
theory of gases. In the 1870s Boltzmann found the connection between 
entropy and thermodynamical probability (cf. Remark (i)). He showed 
that irreversibility does not contradict Newton's reversible mechanics: 
applying the latter to a large number of particles it necessarily leads 
to irreversibility, since systems consisting of millions of molecules tend 
toward a state of greater thermodynamical probability. This is the 
"final reason" for disintegration, amortization, aging (and moral or 
historical decay as some say). 

In 1907 P. and T. Ehrenfest created a model which elucidates the par- 
adox of reversibility-irreversibility by the help of Markov chains. 
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b) The paradox 


Suppose we have a system of N molecules; each of them can be in one 
of two possible energy levels (states). If a molecule is in the first state, 
in one step it will get into the other state with probability p (and stays 
in the first one with probability 1 —p); if it is in the second state, (in 
one step) it moves to the first one with probability q (and remains in the 
second state with probability 1 — 4). As each molecule can “choose” 
from two possible levels, the system of N molecules can be in 2" different 
states. If we consider the molecules indistinguishable, the system can be 
only in N+1 different states: the state of the system is determined by 
the number of molecules on the first level. Let X, denote the (random) 
number of molecules being in the first state at time t. Then X, Xa, X3, ... 
is obviously a Markov chain, which describes the development of the 
system. How can this model reconcile the reversibility of classical 
mechanics (symmetry in time) and the irreversibility of thermodynam- 
ics (asymmetry in time)? 


c) The explanation of the paradox 


It can be shown that if || —^p—4|-1, then the limit of the generating 
function of the distribution P(X,—j, X,,,—kX) is 


lim E(z*tw*e+s) = 


t+ со 


x саара лы 
(P4) 


and this function is symmetric in z and w, therefore in equilibrium: 
P(X, = j, Х,., = k) = P(X, = k, Xs = j). 


This equation expresses the symmetry between the past and future of 
the process (reversibility), whereas the following relation expresses irre- 
versibility : 


P(Xr+s = k|X, = Ј) 7 Р(Х, +, = j|X, = k). 
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If, for example, р=4=1/2, then 


L АРЕ . AUN Д =N. 
tim P(X, =) = ($) 2", 


so the probability that the Markov chain approaches N/2 is greater than 
the probability that it moves away from N/2. 


d) Remarks 


(i) Let f(v, f) denote the distribution of the random velocities of gas mol- 
ecules at time t (for simplicity we assume that the distribution is inde- 
pendent of the position of the gas molecules). Boltzmann formulated 
his theorem in the following way in 1872: the derivative of the function 


H(i) = f fu, ) Лов, Јо, t) dv, a = 1, 


cannot be positive, that is, H cannot increase as t increases. ( — H corre- 
sponds to thermodynamical entropy, which, accordingly, may not decrease 
一 it usually increases.) In 1876. J. Loschmidt, an Austrian physicist, 
raised the question of reversibility-irreversibility in the following form: 
the laws of classical physics are invariant under the transformation 
t> —t (they contain second derivates with respect to t), whereas the 
transformation ft» —t turns Boltzmann's theory to the opposite: 
H( — t) can never decrease. Through the analysis of this paradox it turned 
out that for the proof of Boltzmann's theorem the perfect homogeneity 
of molecular collisions has to be assumed, which is an exaggerated 
idealization. Boltzmann's theorem is valid only statistically: the proba- 
bility that H(t) increases as time passes is very small. 

Another paradox followed from the theorem of H. Poincaré. He 
showed that considering a closed and finite gas system, the phase point, 
which describes the state of this system (and moves on an equipotential 
surface of the multidimensional Euclidean space) returns to an arbitrary 
small neighbourhood of its initial position within finite time. But this— 
as E. Zermelo showed in 1896—contradicts Boltzmann's theorem: if a 
process is irreversible (its entropy increases), its phase point cannot be 
recurrent. The statistical formulation of Boltzmann's theorem, however, 
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solves this problem, too: a sequence of events with very small probability 
may lead to the return of the phase point, but, according to Boltz nann, 
it takes 10" years, so this event is practically unobservable, whereas 
irreversibility can easily be observed. 

The Loschmidt—Zermelo paradoxes show that probability theory is a 
crucially important part of the foundation of molecular physics. Hun- 
garian physicists also gained distinction in this foundation. For example, 
in 1926 Leo Szilárd buried the Maxwell demon which gleamed with the 
possibility of a **perpetual motion" machine. (Maxwell stated that if the 
increase of entropy is only statistical, a demon who can track the motion 
of every molecule, could make a perpetual motion machine. But, accord- 
ing to Szilárd, such a **well-informed" demon requires great entropy, 
therefore a perpetual motion machine directed by the Maxwell-demon 
cannot be realized.) 

(ii) The statistical analysis of the text of Onegin was not an isolated 
research at all. At the end of the last century it became fashionable 
to examine the frequency distribution of words in different texts (to help 
language teaching and shorthand writing). The first frequency dictionary 
was published in 1898 by F. W. Kaedig (Hàufigkeitswórterbuch der 
Deutschen Sprache), and it was based on a text consisting of 11 million 
words. The application of mathematical statistics in linguistics, however, 
has become a separate science owing especially to the American scientist 
G. K. Zipf (1902—1950). His book **Human behavior and the Principle 
of Least Effort" expounds a very complex idea. 
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3. THE PARADOX OF BROWNIAN MOTION 


a) The history of the paradox 


While performing microscopic experiments Robert Brown, the English 
botanist (1773—1858), discovered not only the nucleus of the cell but 
also another interesting though at that time unexplainable phenomenon: 
the random motion of colloid-size particles, known today as Brownian 
motion. Since he made his first experiments (June—August 1827) with 
pollen, it was supposed that the motion was biological. Brown's great 
merit was the experimental proof of the sole physical nature of the phe- 
nomenon. At that time microphysics was not developed enough to be 
able to explain the phenomenon. No wonder that even in 1879 C. W. 
Nägeli, a Swiss—German biologist, refused to believe that the Brownian 
motion was due to the thermal diffusion of molecules. On the other hand, 
J. H. Poincaré claimed in a lecture (Paris, 1904) that when big particles 
about the size of 0.1 mm are hit very many times from all directions by 
moving atoms, they do not move because the random collisions neutral- 
ize each other, according to the laws of large numbers; smaller particles, 
however, do not get pushed enough to neutralize each other so the par- 
ticles move in a zigzag path. The quantitative explanation was given 
independently by Einstein and the Polish Smoluchowski in 1905. According 
to Einstein's theorem, the mean path of the particles is proportional to 
the square root of the time t. Consequently, their mean speed is propor- 
tional to 1/ Vt. From this it follows that the instantaneous speed of the 
particles would be infinite in any moment showing that there are prob- 
lems in defining the instantaneous speed for the Brownian motion. 
A deeper mathematical analysis was required to solve this problem. It 
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was not performed until more than a decade later by N. Wiener. To 
acknowledge his merits in this field, the mathematical model of Brownian 
motion is called after him: the Wiener process. The Wiener process is a 
motion with continuous path (its realizations are continuous) which is 
nowhere differentiable with probability 1. This means that the instan- 
taneous speed cannot be defined anywhere. 

Functions which are everywhere continuous but nowhere differentiable 
were already known by mathematicians long before Wiener. Such patho- 
logical functions where, however, considered only curiosities. In 1806 
A. M. Ampère, the famous physicist, even wanted to show that, apart 
from some isolated points, every continuous function is differentiable. 
Due to researches mainly on Fourier series, the notion of function was 
made much more general by B. Bolzano (1834), G. F. B. Riemann (1854) 
and К. Weierstrass (1872). Weierstrass’ continuous but nowhere differ- 
entiable function was firstly published by P. Du. Bois-Reymond in 
1875. Most outstanding mathematicians were not too enthusiastic 
about this invention. According to Poincaré (Science et Méthode, 1909), 
“In the old days when people invented a new function they had some 
useful purpose in mind: now they invent them deliberately just to 
invalidate our ancestors' reasoning, and that is all they are ever going 
to get out of them." Ch. Hermite wrote to I. J. Stieltjes in a similar way. 
“With horror and dread do I turn away from this miserable plague: 
functions that have no derivatives." The Wiener process was an obvious 
refutation of the above accusations, because nobody could say that the 
Brownian motion was invented only to create a pathological counterexam- 
ple. 20th century researches also made it clear that among continuous 
functions just the undifferentiable ones are typical, in a sense, they are 
in overwhelming majority. (Oxtoby, L. J. C., Measure and Category, 
Springer, New York, 1971). In practice, however, most continuous func- 
tions are differentiable. It is just like the case of irrational numbers. 
In spite of their majority among real numbers (a random number is 
irrational with probability 1), in practice we generally use rational numbers. 
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b) The paradox 


The trajectories (realizations) of Brownian motion are rather irregular 
(ie., they are nowhere differentiable). In the usual sense we consider 
any irregular curve, such as the trajectory of planar Brownian motion, 
one dimensional. At the same time it can be shown that the trajectory 
of a planar Brownian motion actually fills the whole plane (each point 
of the plane is approached with any given accuracy with probability 1). 
Therefore the trajectories can also be considered as two-dimensional 
curves. Which conception is preferable? 


c) The explanation of the paradox 


The notion of dimension was used in the common, every day sense even 
at the beginning of this century. Curves, surfaces, and bodies were 
considered one, two, and three-dimensional, respectively. Generally, 
a figure is said to be k-dimensional if k parameters (coordinates) are 
required to "characterize" the points of it. Using Poincaré’s intuitive 
ideas L. E. J. Brouwer defined the topological dimension in 1913. Later, 
in 1922, K. Menger and P. S. Uryson, working independently, also succeed- 
ed in defining it. (For more details see the book by Hurewitz and 
Wallman.) By the definition of topological dimension, the Brownian 
motion is one-dimensional. On the other hand, in 1919 F. Hausdorff 
introduced the following notion of dimension, according to which the 
Brownian motion is two-dimensional. In the d-dimensional Eucli- 
dean space the volume of the unit sphere is v(d)— I (1/2)/T (1 +d/2), 
where I' denotes the usual gamma function (see the Notations). This 
expression has sense even if 4=0 is not integer. Let a set E be given 
in the n-dimensional Euclidean space, which is covered by a finite number 
of n-dimensional spheres with radii гу, ғ, .... The Hausdorff d-measure 
of the set E is then 
lim inf > v(d)rf. 


r—0 гг 
A. S. Besicovitch has proved that there always exists a (real) number D 


that in case d<D the d-measure of the set E is infinite but if d>D it 
is 0. This number (D) is called the Hausdorff or Hausdorff— Besicovitch 
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dimension of the set E. In this sense the value of the dimension need not 
be an integer. E.g., both coordinates of the planar Brownian motion 
as functions of time (i.e., the curves of the “one-dimensional Brownian 
motion") have Hausdorff-dimension of 3/2. These curves are therefore 
somewhere between being a “real” curve and a “real” surface. The di- 
mension of the curve of the planar Brownian motion is 2 just like that 
of the “real” surfaces. 


d) Remarks 


(i) In the last few years many papers have been published on figures 
whose topological and Hausdorff-dimensions are different. B. Mandelbroit 
called them fractals. Fractals, e.g., Wiener processes, play a fundamental 
role in describing irregular figures of nature. While the Euclidean line 
is the most frequent "letter" describing regular forms of nature, for 
irregular forms (clouds, seashores) it is the Wiener process. In fact 
neither “real” lines (having extension in length only) nor “real”? Wiener 
processes (nowhere differentiable) exist in nature but with their help a 
fairly good picture of “real” forms can be obtained. Fractals have also 
put the famous Olbers’ paradox of astronomy in a new light. According 
to the paradox, it is inconceivable that the sky does not shine uniformly 
at night if the stars are uniformly distributed in space. (See Mandelbroit's 
book.) 

(ii) In his book Mandelbroit also mentions other notions of dimen- 
sion, such as the Fourier dimension. As for algebraic dimension see 
Székely's article. 

(iii) The irregularity of the Wiener process led to the development of 
a new frontier of probability theory and analysis, namely the theory of 
stochastic differential equations. This theory produces great deviations 
from the usual differential and integral calculus. E.g., if f(t) is a differen- 
tiable function, then 


оао = БЧ 


In the theory of stochastic integrals the above expression makes sense 


11 Székely 
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even if f(t) is the (nowhere differentiable) Wiener process. In this case 
the result of the integral is less than in the above (differentiable) case. 
The difference is exactly 1/2. 
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4. THE PARADOX OF WAITING TIMES (DO BUSES RUN 
MORE FREQUENTLY IN THE OPPOSITE DIRECTION?) 


a) The history of the paradox 


Though modern technology continually shortens wasted waiting times, 
our everyday nervousness is in great part due to useless waits. So the 
efforts of mathematicians and engineers to reduce waiting times is 
followed with great interest. It was А. К. Erlang who examined waiting 
time problems for telephone exchanges (cf. 1/6 Remark (iii)). In the 
1930s W. Feller introduced the notion of birth and death processes, 
which gave an impulse to the mathematical analysis of waiting and 
greatly contributed to the emerging theory of operations research. The 
study of queuing systems has become an independent branch of science 
on the borderland between probability theory and operations research. 
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b) The paradox 


The "frequency of bus runs", i.e., average time passing between the 
arrival of two consecutive buses is usually indicated at the bus stops. 
Suppose the frequency of runs at a certain bus stop is 10 minutes. Then 
we expect that people have to wait 5 minutes on the average for a bus. 
It was found, however, that the average waiting time may not only 
exceed 5 minutes, it may even be infinite! (Experience shows, however, 
the situation in everyday life is not so bad.) 


c) The explanation of the paradox 


If buses arrive at the bus stop not only on an average but exactly every 
10 minutes, the average waiting time would really be 5 minutes. But 
buses actually run in “packs” (except if they are near the terminal where 
they started from). Therefore waiting times show a large dispersion about 
the average value. Let us suppose that the time intervals between con- 
secutive arrivals of buses at a bus stop are independent identically distrib- 
uted random variables with expectation m and standard deviation s. 
Then it can be shown that the average waiting time is T= (m? 4-5?)/2m. 
Let F(t) be the distribution function and f(t) be the density function of 
the intervals between consecutive arrivals of buses. (We assume now that 
the density function exists, though this condition may be omitted at 
the cost of some modifications.) Suppose that time ¢ is measured from the 
arrival of the last bus before our arrival. Then the density function of the 
random time interval till the arrival of the next bus is not f(t) but another 
function which is proportional to t.» f(t) (i.e., t-f(t)/m), since the prob- 
ability that we arrive during a certain time interval is proportional to 
its length 上 Thus the average waiting time is. 


1 


оа m? -- s? 
= کے‎ 2 = 
T=5, 04 


2m 


(The density function of our waiting time is (1—F (t))/m.) Accordingly, 
T=m/2 only if s=0, but if s=oo then T=, too. These extremes are 
naturally far from reality. Buses actually arrive at intervals which have 
nearly exponential (“ageless”) distribution with some parameter A. 


11* 
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Then m=s=1/A, that is, T=m, meaning that if the frequency of runs 
is 10 minutes, the average waiting time is also 10 minutes and not 5. 

The heuristic explanation of this paradox is very simple. If somebody 
arrives at the bus stop at random, he has a much greater chance of 
waiting for a long time, since his waiting time would only be short if 
he caught one of the buses in a “pack”; but the buses of a “pack” 
arrive at very short intervals so one has not much chance of catching 
any of them. Consequently, if the time intervals between consecutive 
buses show great dispersion, there are only a few people who have a 
short wait and there are many people who have a long wait, meaning 
that the average waiting time T is large. 


d) Remarks 


(i) We often have the illusion that no matter which way we would 
like to go, buses and trams run more frequently in the opposite direc- 
tion. Naturally this is impossible in reality. The explanation is very 
simple. We see only one bus (the one we take) which runs in the same 
direction as we want to go, whereas the probability of two or three buses 
passing in the opposite direction while we are waiting is positive. Their 
expected number is 


m*-c-s* | us 
2m > N 


which is really greater than 1/2 if s is positive. This shows an asymmetry 
between the two directions. But in fact this is not the case. The symmetry 
between the two directions can be expressed by the fact that the proba- 
bility that no bus will go in the opposite direction while we are waiting 
for our bus is just 1/2 (but if a bus goes in the opposite direction then 
more than one may also go, so the expectation may be arbitrarily large). 
Let p, denote the probability that exactly k buses pass in the opposite 
direction while we are waiting. If the intervals between the arrivals of 
consecutive buses are exponentially distributed, then 


1 k+1 
n-[;) : 
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if they are uniformly distributed on the interval (0, 1), then 


1 2 1 
рет [сот REED "uu 


where k=1, 2, ... (po is always equal to 1/2 as we have mentioned). 

(ii) Where there is a congestion of bus traffic, the bus service would 
become steadier if the congested buses waited for longer time at a bus 
stop; thus the average waiting time would also shorten. (Actually I have 
never seen buses waiting at a bus stop, just to make the traffic steadier, 
though lifts are sometimes held back to wait for people who are likely 
to arrive soon. So lift traffic is slowed down to shorten the average 
waiting times!) Let fı, tz, ts, ... denote the times when buses arrive to a 
certain bus stop, and let X,—1,, Y,—f,—1;.,, (i=2,3,...). If the 
distribution function of X1, Xs, ... is F, then—as we have already 
mentioned—the density function of X, is [1— F(t)]/m, whose expecta- 
tionis T'—(m?--5?)/2m. Slowing down the traffic means that we increase 
the X;s to X;+g(Xj), (i=2, 3, ...) by a non-negative function g. It 
can be shown that among integrable functions the function g(x)— 
—max (0, (c—x)) most shortens waiting times, where c is the unique 
solution of the following equation: 


cE(X)+ f (c9 FO) dx =(E(X*)/2, 


(where X is a random variable having the same distribution as X;, Xs, ... 

If for example X is exponentially distributed, more precisely if F(x)= 
—]1—e-^* (x>0), then both the expectation and the variance of the 
waiting time equals 1. If we choose the optimal g(x)=max (0, (0.901 — x)) 
(accurate to three decimal places) then the expected value of our average 
waiting time is only 0.901 (and the variance is 0.691). 

(iii) The following paradox is connected with traffic, too. (G. Schay 
drew my attention to the problem itself, after my talk in MIT in 1983.) 
The paradox states that it is not true that the faster cars go the more of 
them can get through the green light, as, at higher speeds, cars have 
to keep greater distances. Let us start from the following model to cal- 
culate the optimal speed. Suppose cars go at the same speed 2; let X; 
denote the time (depending on chance), between the ith and the (i+ 1)th 
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car getting into the traffic; X;’s are independent and identically distrib- 
uted (for simplicity we shall assume that this distribution is exponential 
with a parameter 4>0). Cars arrive at the traffic lights at intervals 
Ү,, Y,, ... which are not simply equal to X1, Xs, ..., since cars have to 
follow each other at a certain distance. Let /; denote the length of the 
ith car and a; its braking deceleration; /; and a; are usually not independ- 
ent, but we assume that the vectors (/;, aj) i=1, 2, ... are independent 
and identically distributed. The braking distance is v*/2a;, thus cars 
have to follow each other at the distance /;4-v?/2a;. The time between 
the arrival of the first and (n+1)th car is 


n n n—i 
ХҮ, =max{ 2 X; > Y,+Z,-1}, 
і=1 і=1 Йе! 


v? 
2; = (+). 


M(t) = max (n: > Y, = 1), 


where 


IP 


then the number of cars which can get through the green light from time 
t to t+h is M(t+h)—M(t). It is known (from the theory of queues) 


that EZ) ar EX) EGO) 


= ;(this corresponds to the traffic in rush hours) 
E(X) if E(X) > E(Z). 

Let t be a random time within the interval [0, T]. Then the average num- 

ber of cars which go through the green light is 


lim м 


¬+ co 


T T+h h 
= f Meem- Meam e [ f M()dt— f мо dt] T- 
0 T3 0 
= hmin(E(Z)-!, E(X)-1)-o(D, if T+; 


(0(1) denotes a quantity which converges to 0, as T - е.) Therefore we 
seek the maximum of 


min (4, [E(1)/v 4- E(az 1) v/2] 1) 
for v. The second term is maximal if 


v = V2E(IJ/E(ar >). 
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5. THE PARADOX OF RANDOM WALKS 
a) The history of the paradox 


About 60 years ago George Polya, the American mathematician of Hun- 
garian origin, used to walk in a park where he kept meeting the same 
couple. At that time he did not realize how accidental these random 
meetings were, i.e., how small the probability was. Shortly afterwards 
he calculated the probability of meetings in a model where 2 persons 
are walking randomly on a squared network independently of each 
other (at each crossing the probability of choosing any of the four pos- 
sible directions is the same). Polya found that the probability of meeting 
was 1. (Consequently if their time were unlimited they could also meet 
infinitely many times with probability 1.) In the case of a cubic network, 
however, the probability of a meeting is strictly less than 1 (so the proba- 
bility of infinitely many meetings is now 0). From this interesting dis- 
covery a brand new branch of probability theory has developed during 
the last 60 years. In 1964 a nice monograph was written on this theme 
by F. Spitzer. 


b) The paradox 


From Polya's theorem it follows that considering a random walk on the 
integer points of the real line starting from the origin and moving at 
every step by 1 either to the left or to the right with the same probability 
1/2 (independently of the previous steps), we shall get back to 0 with 
probability 1. Now the question arises that before returning to 0 (for 
the first time) how many times has the walk reached a fixed integer К? 
It is natural to suppose that the greater |k] is, i.e., the farther the random 
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walk goes from the origin the fewer times it will happen, on average. 
Surprisingly, the random walk will always reach К before the first 
return just as many times on average, namely once, however great 
|k] is. 


c) The explanation of the paradox 


The paradox can be explained very simply. The average number of steps 
necessary to return to the origin (i.e., the expected value of the recurrence 
time) is infinite, consequently there is enough time to reach any point 
on the line once, on average. A related paradox is the following. The 
starting point of the random walk is only finitely many times is its 
most visited site (with probability one). 


d) Remarks 


(i) Under the above conditions suppose that we always take 2 steps 
to the right but only 1 to the left. In this case the random walk is not 
symmetric and one can easily see that starting from 0 the probability 
of reaching —1 is less than 1. This probability is, suprisingly, just 
(Y5— 1)/2, i.e., the ratio of golden section. 

(її) Diffusion type random walks where the probability of moving 
to the left or to the right depends on actual location (k) are very important 
in practice. Let p, denote the probability of moving to the right and 
1 —p, the probability of moving to the left. Suppose furthermore that 


1059 
py = О k 


(at least for great values of |k|) where c is an arbitrary constant. This 
kind of random walk returns to 0 with probability 1 (i.e., it is recurrent) 
if с=1/2. In the case c— —1/2, the expected value of the recurrence 
time is finite, therefore the paradoxical situation of the random walk 
(corresponding to c—0) cannot appear. 

(iii) Researches on random walks can be extended from squared net- 
works to more general ones called graphs. These generalizations have 
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interesting applications in the theory of electronic networks. See the 
fundamental paper by C. Nash- Williams written in 1959. Many other 
applications (in physics, chemistry and biology) are discussed in Weiss' 
paper. 
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6. THE PARADOX OF STOCK EXCHANGE; 
MARTINGALES 


a) The history of the paradox 


The mathematical study of the Stock Exchange is of almost the same age 
as the Stock Exchange itself. Presumably not even Gresham's Exchange 
in the 16th century was free from mathematical speculation, but the 
basic methods of probability theory were not applied in this field for 
quite a long time. It is typical that even in 1900, when Louis Bachelier 
defended his doctoral thesis in Paris on the connection between price 
fluctuations in the Stock Exchange and Brownian motion (preceding 
the physicists’ investigations concerning Brownian motion), the com- 
mettee hardly appreciated his essentially new ideas. Bachelier created 
the general mathematical model of a fair game, the so-called martingale, 
which later became one of the most important stochastic processes after 
the researches of J. Ville, P. Lévy, D. L. Doob and others. A sequence 
of random variables X,, X,, Хз, ... is called a martingale if the condition- 
al expectation of the difference X,,, — X, (‘the profit gained at time n”), 


158 Chapter 3 


given the total capitals Y,, X,-1, ..., is zero with probability one, for 
every n, that is, 


E(X,41— ХБ X,-1- ++) = 0 


with probability one. The sequence X;, Xs, Хз, ... is а supermartingale 
(or submartingale) if the above mentioned expectation is not positive 
(or negative) with probability one. The martingale is a general model of 
fair game, of "quantitative justice", which can be applied in many 
fields, for instance, in the study of paradoxes in the Stock Exchange. 


b) The paradox 


If a share is expected to be profitable, it seems natural that the share is 
worth buying, and if it is not profitable, it is worth selling. It also seems 
natural to spend all one's money on shares which are expected to be 
the most profitable ones. Though this is true, in practice other strategies 
are followed, because while the expected value of our money may increase 
(our expected total capital tends to infinity), our fortune itself tends to 
zero with probability one. So in Stock Exchange business we have to 
be careful: shares which are expected to be profitable are sometimes worth 
selling. 


c) The explanation of the paradox 


Let us suppose that we would like to buy shares, and we can choose from 
К different ones; in a one year period the ith share (i— 1, 2, ..., К) yields 
X times as much profit as our initial capital was at the beginning of 
the year. (Obviously X? = —1.) Suppose, for simplicity, X? is bounded, 
though this condition can be omitted after some modifications of the 
reasoning below. The random vector ¥=(X®, ..., X?) describes the 
quotations. We assume that the vectors X ; J=1, 2, ...), which describe 
the quotations in the jth year are of the same distribution as X and 
are independent. Let T, be our initial capital and let a® denote the 
proportion of our total capital that we spent on buying shares of type i 
in the jth year. The quantity a —0 may depend on the random vectors 
Ху, Xs, ..., Xj-1. The vector a,=(a, a, ..., a) describes our buying 
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strategy in the jth year. Evidently 


Let a; X; denote the following sum: 
v Oy 
i 1 
SUMO 


Using this notation, our total capital at the end of the nth year is 


n 


T, Tg II (1 a; Xj). 


j=1 


Obviously, the expectation of Т, is the greatest if we spend all our money 
on the most profitable shares every year. (We suppose that at least one 
of the shares is profitable.) In this case the expected value of T, tends 
to infinity (so we are likely to grow rich) and still our total capital 
T,, may tend to zero with probability one! Let us examine this paradoxical 
situation in detail. Clearly, 


log Т, —logT) = > 108 (1+а,;Х)). 
ј=1 


Assuming that а; =а is a constant vector (independent of j; this assump- 
tion is quite natural since in our case the quotation distributions do 
not change), the right side of the equation is the greatest (with proba- 
bility one for large п, according to the law of large numbers) if 


E(log (14-a;X;)) 


k 
is maximal (under the conditions a®=0 and 7 af? =1). Let a* 
i=l 


denote the strategy which maximizes the above quantity. Further 
let T* and T, denote, respectively, our total capital if we follow the 
strategy а* or an arbitrary strategy a. Then it can be shown that the 
sequence T,/T* (n—1,2, ...) is always a non-negative supermartingale 
(moreover, if every coordinate а* of the vector a* is positive and 


k 
d =I, 


iz1 
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then it is a martingale). Therefore, according to a well-known theorem 
of martingale theory 


lim T,/T* =T 


always exists with probability 1 and its expectation is at most 1. Thus 
a* is an optimal strategy (in this sense) in the long run. In sum: it is 
advantageous to maximize the expectation of log Т, and not that of 
T,. The heuristic explanation of this fact is quite simple: Т, increases 
exponentially for every reasonable strategy, and the rate of this increase 
can be maximized just by maximizing the expected value of log Т,. 

Consider now a simple (but extreme) example. Suppose we can choose 
from two kinds of shares; p,,—1096 is the chance that the values of 
both shares double, poo=5% is the probability that both shares lose 
their value, the probability that the value of the first share doubles and 
the second deteriorates is p,9—509?6, and with probability po,—3596 
the same happens inversely. Then the first share is profitable (with a 
probability of 60%) and the second one is losing (it is profitable only 
with a probability of 4596), but it is still reasonable to buy some of both 
kinds of shares, more exactly to spend one third of our money on the 
shares every year in the ratio of 13:4. Generally it is reasonable to spend 
the 


P11 — Poo 
Pu + Poo 
proportion of our money on the two kinds of shares in the ratio of 


(P11 Ріо — Por Poo): (P11 Por — 10 Poo) 
(assuming that the differences are positive). Though the problems which 
occur in the practice of Stock Exchange business are much more compli- 
cated than the preceding example, the paradox in question also appears 
in these more complex problems. 


d) Remarks 


(i) The martingale as a system of play was well-known long before the 
appearance of the mathematical theory of martingales. (We shall 
quote from the paper J. L. Snell; “Gambling, probability and martin- 
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gales”, The Mathematical Intelligencer, 4, 118—124, (1982). “The basic 
idea of the martingale system is to double when you lose. For example, 
suppose that we are playing roulette and we bet each time on red. We 
make an initial bet of $1. If we win we quit; if we lose, we make a bet 
of $2 next time. If we win, we are $1 ahead and we quit; if we lose, we 
are down $1 $2 $3, and we bet $4. If we win, we are $1 ahead and quit; 
if we lose, we bet $8 next time, etc. Under this martingale system, if the 
wheel ever stops on a red number, we leave the casino $1 richer than 
when we entered. Since a red is bound to show up eventually, it seems 
that this is a foolproof system. But suppose that we enter the casino 
with $100 and we encounter a run of 6 black. Then we have lost 2% — 1 —63 
dollars and we cannot make the next required bet of $64. 

In his book Newcomes (The Newcomes; Memoirs of a Most Re- 
spectable Family, Chapter 28, Page 266, London, Bradbury and Evans, 
1953), Thackeray remarks “You have not played as yet? Do not do so; 
above all avoid a martingale if you do." While this is a good advice for 
the gambler, mathematicians have not heeded it, and many of important 
results in probability theory have come from ignoring Thackeray's 
advice. 

(ii) Thomas Gresham (1519—1579) the founder of the London Stock 
Exchange, must have guessed that mathematics had an important part 
in the analysis of Stock Exchange and economic life. Gresham's testament 
included the plan of a college where mathematics was one of the main 
subjects of economics teaching. Henry Briggs, who first published a 
logarithmic table in 1617, was also a professor at Gresham College, 
which can be considered, in many respects, the predecessor of the Royal 
Society. 

(iii) Martingales have many interesting applications in genetics, po- 
tential theory, stochastic integrals, etc. The monographs of J. Neveu, 
P. A. Meyer, C. C. Heyde, and P. Hall are outstanding in this field. 
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7. QUICKIES 
a) Jacob and Laban's paradox 


According to the biblical story of Jacob and Laban, Jacob got Laban's 
dappled sheep in return for his services. Though the proportion of dap- 
pled sheep to others was very small, Jacob gradually acquired greater 
wealth than Laban. There are many mystic explanations of this paradox 
(the Bible itself contains one, and Thomas Mann also dealt with this 
riddle), but—as Alfred Rényi once pointed out—there is nothing mys- 
terious about this paradox at all; it can be understood by simple mathe- 
matical inference based on the fact that Jacob never returned sheep to 
Laban but Laban always gave Jacob some of his own sheep. 

Let us denote the average number of Jacob's and Laban's sheep in 
the nth year, respectively, by J, and Lẹ, (in the initial, Oth year J,—0 
and Ly is a positive number). Let us suppose that each sheep has U lambs 
every year on the average. Let q denote the proportion of Laban's sheep 
that he gives Jacob (р=1—@ proportion remains at Laban) Then 
L,41—L,—UpL, and J,,,—J,—UJ,-UqL,, consequently „= 
=%(1+р)" and J„=Lo(1 +U) — Lo(1 + Up)", therefore 


Sy 1+U 
L,  \t+Up 


and this tends to infinity as n increases, so Jacob will really be richer 
than Laban after a time. For example, for g=10%, U=2,n=20, the 


EE ; 
ratio I. is approximately 3. 
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b) A paradox of processes with independent increments 


Processes with independent increments and their discrete versions, the 
partial sums of independent random variables, are classical areas of 
probability theory. Let X,, X;, ... be independent (not identically zero) 
random variables with expectation zero. Then the sums S,— X, 4- X44 ... 
eX, n1, 2, ... fluctuate about zero, i.e., (according to a theorem 
of K. L. Chung and W. H.J. Fuchs proved in 1951) if Ys have a common 
distribution, then 


P(lim sup S, = +œ) = P(lim inf 5, = — œ) = 1. 


This fluctuating property, however, does not necessarily hold if the X;'s 
are not identically distributed. Put, e.g, X,—Y,/y1—i-?, where 
P(Y,=i-})=1-i-? and P(Y,=—i+i-)=i-*. If the Уз are inde- 
pendent, then the X;s are also independent and have expectation zero 
and variance one. According to the Borel—Cantelli lemma, if 
Ai, Аз, Аз, ... are arbitrary events and the sum of their probabilities 
converges, then, with probability one, only a finite number of events 
A, occur. Hence the event Y;— —i--i^! also occurs only finitely many 


times (since Xi e, so for n sufficiently large, Y;—i^! with 
і=1 

probability опе, that is, X;—1/yi?— 1, thus 
P(lim S, =~) = 1. 


n- со 


c) The paradox of goals 


Two teams 4 and B are playing football against each other. Suppose the 
teams have equal abilities (i.e., both teams score the next goal with prob- 
ability 1/2 at any time during the match). If the length of the time inter- 
val between two consecutive goals is constant, then it seems natural to 
think that for 50% of the playing time team А leads and for 50% of the 
time team В leads. Surprisingly, however, just the contrary is true: it is 
most improbable that 4 (or B) will be in the lead for the half of the play- 
ing time (if the cumulative scores are equal, the leading team is considered 
the one which was leading before the last goal). If n=20 goals were 
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scored during the match, then the probability that after 10 goals team A 
and after the other 10 goals team B leads is only 6%; however, the prob- 
ability that one of the teams will be in the lead throughout the game 
is approximately 35%. It is also surprising that the probability that one 
team leads throughout the second half is 50 per cent, no matter how 
large п is. 

The situation changes considerably if the ‘‘goal-scoring ability" of 
the teams depends on the score of the game. Let 


=} (1+4) 
Pk = 7 k 


be the probability that 4 scores the next goal if team 4 leads by k goals, 
and k#0; ро= 1/2. ІЁ с is large and k is small, 0<p,=1 may not hold; 
then let p,—1/2. (If c=0, then p,=1/2 for every k and this leads to 
the simple model we have just examined.) If c is positive, the leading 
team has more chance of scoring the next goal. If c— 1/2, then after a 
time one of the teams “‘breaks down", that is, if many goals are kicked, 
only one team leads (which, depends on chance) nearly during 100% 
of the playing time. On the other hand, if c is negative, then the losing 
team scores a goal with greater probability; for c<—1/2, the match 
is very varied and interesting: for half of the play one team leads and for 
the other half the other team. 

It can be shown that for c—0, the probability that team 4 will lead 
at most for the fraction x (0<x~<1) of the playing time converges to 


2 = 
Е(х) = 元 arc віп үх as ne, 
The corresponding density function for 0<x<1 is 


1 
х) == == 
Л л Vx(1—x) 
which is minimal for х=1/2. Thus the probability density of А leading 
for exactly 5096 of the playing time is really the smallest. This is Paul 
Lévy's arc sine law (1939). (My conjecture is that in the general case the 
density function is proportional to the (2c + 1)th power of f(x) if c<1/2.) 
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Finally one more surprising fact: in the case с=0, if the game ends in 
a tie (n: п) and we want to know how long team A was in the lead, and 
we take the interval between two consecutive goals as the unit of time, 
then the probability that A was in the lead for a time 2k 
(k=0, 1, 2, ..., п) is independent of К! 


(Ref.: Feller, W., An Introduction to Probability Theory and its Applications, John 
Wiley, New York, 1969. 
Lamperti, J., “Criteria for recurrence or transience of stochastic process I,” 
J. Math. Anal. Appl., 1, 314—330, (1960).) 


d) The paradox of expected ruin time 


A and B are playing a coin tossing game. If it is heads, then 4 pays B, 
if tails, B pays A 1 dollar. A's initial capital is 1 dollar and B's is 999 
dollars; they play till one of them is ruined. 4 has of course more chance 
of running out of money first. If the coin falls heads at the first toss, 4 
is already ruined. Surprisingly, however, the expected duration, of the 
game is quite long: on average one of them is ruined only after 999 trials. 
(Is this duration really considerably longer than we would expect? In 
general it can be proved that if 4 has a dollars, and his adversary B 
has b, then the average duration of the game is ab trials, especially if a— b, 
then the expected duration of the game is а?.) F. Stern examined the case 
where the coin is not necessarily true, and called attention in 1975 to a 
surprising phenomenon (Math. Mag. 48, 286—288.) Suppose А wins 
with probability p in each turn and B with probability 1 —p (0<p<1), 
and they both have a dollars at the beginning of the game. It seems evi- 
dent that if p#1/2, the conditional expectation of the duration of the 
game, given that 4 is ruined at the end is completely different from the 
conditional expectation given that B is ruined at the end of the game. 
It can still be shown that either A’s or B's ruin is assumed, the aver- 
age durations of the games, and also their distributions are equal. 
The proof is simple: the probability of B's ruin after the (2k + a)th trial 
is given by Parta Crap *"(1—p)* (k=0, 1,2, ...), and similarly the 

probability of A’s ruin after the (2k --a)th trial is 4, +=, ,p*(1—p) +" 
where c, а iS the total number of games which consist of exactly k heads 
and k +a tails. As the ratio рәр + 4:2 + „ 15 independent of k, the condition- 


12 Székely 
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al distributions Pək+a/ 2 Porta and gak+a/ 2; gak+a are identical—as we 
k k 


have stated. The explanation of this phenomenon lies in the following 
fact: if p=0.99, then a long game results in B’s ruin with great proba- 
bility, so the expected duration of the game, given that 4 will be ruined 
at the end is very short, just as in the case where B’s ultimate ruin is 
given. (For another argument see E. Seneta, “Another look at independ- 
ence of hitting place and time for simple random walk", Stoch. Proc. 
and their Appl., 10, 101—104, (1980).) We have already mentioned that 
if both А and В have a dollars and they play with a true coin, then the 
expected duration of the game is a? turns; but what happens if they play 
with two different coins: the probability that 4 will win with the first 
coin is p,—1/2--e, and with the second coin, it is p;—1/2—&, (0 一 2 一 
<1/2). The players choose p, or р» in each turn, depending randomly 
оп 4° accumulated gain k, (k— 1,2, ...,2a— 1), in the following way: 
before starting the game, we draw ру or р» for each value of К, independ- 
ently of each other and with equal probability. One may feel intuitively 
that this game with its complicated formulation is quasi-identical—at 
least for large a—with the game where p,=p.=1/2 for every К (i.e., 
the classical coin tossing game), since the large number of terms +e 
equalize each other for large a. But this is not so. J. G. Sinai has recently 
pointed out that the average duration of this complicated game is far 
longer. Even the logarithm of the average number of necessary tosses is 
of order Ya (in contrast to the above mentioned a?). This surprising fact 
can be explained on the basis of Remark (i) in 1/9. In a sequence of length 
a, which consist of independent and equally probable p,’s and pəs, 
there is a p, or p, run of length log; a with large probability, and this 
drifts the gain towards the initial capital, so it delays the ultimate ruin. 
It is very difficult to get over this “thick wall”, and that is why the average 
duration of the game increases. Problems of this type (i.e., random walks 
in random environments) are in close connection with the theory of 
random fields in the last quickie. 
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е) A paradox of optimal stoppings 


We are playing heads or tails with a fair coin so that we stop playing after 
the nth toss. In this case we win either 


n n 
n+1 


dollars or nothing, depending on whether the outcome is always tails 
or not. When is it advisable to stop? Let 7, denote our prize (depending 
on chance) after the nth game: 


n »" T 
L= "EI ООШ ОБА = 0, 


Supposing that 7,40 the expected value of the prize 7,,, is 


EOI oy SE eat 


n+ 1 
n+2 2 


um. fs n ; Eo 2 
which is greater than a 2" meaning that it is always worth going on 
n 


playing. The probability, however, of 7,—0 for some (possibly large) n 
is 1. Is it really worth playing till we lose everything? 


(Ref.: Chow, Y.S., Robbins, Н. and Siegmund, D., Great Expectations: The Theory of 
Optimal Stopping, Houghton Mifflin, Boston, 1971. 
Shiryayev, A. N., Optimal Stopping Rules, Springer, New York, 1978.) 


f) The paradox of choices 


One often should choose the best one (from a certain point of view) 
out of a collection of persons or objects (e.g., when shopping or getting 
married). When studying this problem, we assume that the persons or 
objects can be arranged in order of goodness, i.e., comparing any two 
of them, we can always decide which is the better one. Selecting the best 
would cause no problem if we saw all of them together. In most cases, 
however, objects or persons have been tried successively and once re- 
jected, one cannot return to that. In the following we will assume that 


425 
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if a “candidate” is not selected when it is his turn then we will not have 
the opportunity to change our minds later. The problem is not unique 
even so. We might not even know the total number of opportunities we 
must choose from. (Generally, we do not have this information when 
choosing our future wife or husband.) Let us suppose that there are 
altogether п possibilities, more precisely let п persons or objects pass us 
in any order (these orders are considered equally probable). Now the 
question is the following. What method should be chosen to select the 
best candidate if any of them can only be compared, naturally, with the 
previous ones. If we always choose, e.g., the third one, the chance of 
selecting the best is 1/n. With n growing, 1/n converges to 0, and there- 
fore if the number of offers is great, the probability of selecting the best 
one is nearly 0. Surprisingly, however, there is a method which enables 
us to select the best candidate with a probability of nearly 30% even if 
nis a large number. The method is the following. Let the first 37% (more 
precisely, 100/e%) of the candidates go and then select the first one better 
than any previous candidate (if none are better, select the last). In this case 
the chance of selecting the best is approximately 1/е, i.e., ~37% however 
great п is. 

If two, three, ..., or generally k choices are allowed and the point is 
only to have the best one among the k candidates selected, then the opti- 
mal probability p, of this event can be calculated as follows. Let the num- 
bers c; satisfy the indentity 


j=l 
then 
k 
р = 2€ 
j=1 
e.g., 
1 1 
Da = 2 eae 


which is more than 1/2(!). It can also be shown that 


1\* 
x АЕ Ре: core 
( 5) рс 


thus p, converges to 1 as k tends to infinity. 
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If the number of candidates N is a random variable, then the chance 
of selecting the best candidate may decrease. Suppose that the distribu- 
tion of N,,/m converges to the distribution of a random variable X. Then 
the optimal probability of selecting the best candidate (more precisely 
its limit as тоо) is 


px — max E(f(x/X)) 


where f(x)=max (0, х Іа х). The probability py may be very small 
since inf рх=0. 


(Ref.: Chow, Y., Robbins, H., Siegmund, D., Great Expectations: The Theory of 
Optimal Stopping, Houghton Mifflin Co., Boston, 1971. Freeman, P. R., “The 
secretary problem and its extensions: A review", Internat. Statist. Review, 51 
189—206, (1983), Berezovskii, B. A., Gnedin, A. V., Optimal Choice Problems 
(in Russian), Nauka, Moscow, 1984.) 


2g) The Pinsker paradox of stationary processes 


A series of random variables X, (n=..., —3, —2, —1,0,1,2, 3, ...) is 
called stationary (more precisely, stationary in a wide sense) if firstly, 
the expected value of X, does not depend on n (therefore we can assume 
without the loss of generality that this common expectation is 0) and sec- 
ondly, the covariances E(X, Xm) =1,—m (the existence of which is assumed) 
depend only on the difference n — т (specially if n = т, the variances do not 
depend on n). A vector valued X, (n= ..., 一 2, — 1,0, 1,2, ...) isstationary if 
E(X,) is identically equal to the zero vector and the expected value of the 
product of the ith coordinate of X, and the jth coordinate of X,, depend 
only on i, ] and n— т. Two basic types of stationary processes are the 
singular and regular processes. The former is deterministic (i.e., for any 
value of n, X, +: does not contain any “information” uncorrelated with the 
random variables preceding X,+1), while the regular type does not have 
a deterministic part (i.e., if we omit X,, X, 4, Xn-2, etc., then we grad- 
ually lose all information). In this way the world of singular processes 
is ready, and it does not gain information as time passes, while regular 
processes create a new world out of nothing, that is, the far future is 
almost independent of the present. (In the Hilbert space of square inte- 
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grablerandom variables, i.e., which are of finite variance, the above state- 
ment can be formulated as follows. If H, denotes the subspace which is 
generated by the random variables preceding X, then in the singular case 
H,— H,-, for all n, while in the regular case (^ H,—0.) The importance 


of singular and regular processes was shown by Wold's theorem, which 
states that any stationary process can be uniquely decomposed into the 
sum of a regular and a singular process. It is rather obvious that if X, 
is singular then X. , is singular, too, and if X, is regular then X_, is regular 
as well. In other words, if n denotes the time parameter then both singu- 
larity and regularity remain unchanged when reflecting past and future. 
Surprisingly, however, it is only true when X, is a scalar. Pinsker con- 
structed a two-dimensional stationary process which is regular but its 
inverse (when — takes the part of n) is already singular. Thus singularity 
may turn into regularity and vice versa if past and future are reversed. 


h) The paradox of voting and electing; Random fields 


When voting or electing, the outcome is generally uncertain and there- 
fore it is not surprising that important probabilistic results have been 
discovered in this area too. In 1878, W. A. Whitworth proved the follow- 
ing famous ballot-theorem. If there are two candidates, say, 4 and B, 
A scores n votes, B scores m votes, and n>m (i.e. A has won) and p 
denotes the probability that throughout the counting there are always 
more votes for 4 than for B (provided that each order of counting is 
equally probable), then 


п-т 
nam. 


Thus, if n=2m, then p=1/3, that is, if А has received twice as many 
votes as B then the probability that B had an equal number of votes 
sometime during the counting is twice as much as the probability that 
A was superior throughout the counting. (See Feller, W., Probability 
Theory and Its Applications, (2nd ed.), Wiley, New York, 1965, p. 66.) 
This may sound strange but it is not a paradox. Paradoxes do, however, 
occur in this field too. Marquis de Condorcet (one of Voltaire's friends) 
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pointed out the following example in 1758. (Essai sur l'application de 
l'analyse à la probabilité des décisions rendues à la pluralité des voix.) 
If there are three candidates in an election, A, B, and C, and they receive 
23, 19, and 18 votes, respectively, then on the basis of majority alone A 
would be the winner but in actual fact, all the 19 who voted for B may well 
prefer C to A. In 1950 Kenneth Arrow (the 1972 Nobel Prize winner in 
economics) used the above example to show that itis logically impossible 
to create an absolutely fair election system. Thus it is not surprising that 
there does not exist a standard election system accepted all over the world. 
(On the probabilistic contraversion of the election system in the USA 
see Grofman's paper.) The following paradox concerns a special kind 
of voting: trials. Let us suppose that А, B, C, D and E are the five mem- 
bers of a jury. They decide whether a prisoner is quilty or not by majority. 
There is a 596 chance that А or В give the wrong verdict, for C and D 
it is 10%, and E is mistaken with a probability of 20%. (Mistakes are 
independently committed.) In this case the probability of bringing the 
wrong verdict is about 0.7%. Paradoxically, this probability increases 
to about 1.15% if E (who is most probably mistaken) abandons his own 
judgement and always votes the same way a 4 (who is most rarely mis- 
taken). The following paradox also shows how surprising situations may 
arise if voters abandon their own judgements. Let us suppose that each 
vertex of a planar square lattice is occupied by people who can vote for 
or against independently of each other with probability p and 1—p, re- 
spectively. Meanwhile each of them chooses one of his four neighbours and 
votes the next time as that person did previously. The third, fourth, etc. 
vote is carried out similarly. (When voting for the nth time, everybody 
gives the (n—1)th vote of the chosen neighbour.) The question is the 
following: What happens if noo? It can be shown that everybody will 
give the same vote in the end, in other words, “perfect harmony" will be 
reached. (The probability that everybody votes for or against is p and 
1] —p, respectively.) It is worth mentioning that if voters are placed at 
the vertices of the three-dimensional cubic lattice (where everybody has 
six neighbours) then such an extreme situation will not occur, i.e., differ- 
ent opinions may harmonize with each other, too (more precisely there 
is an ergodic limit distribution). The same stands for more than three 
dimensions. This fundamental difference between two and three dimen- 


172 Chapter 3 


sion is in close connection with the fact (see III/5a) that in the case of 
two dimensional square lattice, symmetric random walks reach any 
vertex with probability 1; while in 3 dimensions this is not true. (See 
Bramson, M., Griffeath, D., *Renormalizing the 3-dimensional voter 
model", Annals of Prob., 418—432, (1972).) 

The above mathematical model of voters standing in the vertices of a 
square and a cubic lattice, has gained a very important role in the mathe- 
matical physics of the last few years. Voters are replaced by “units” 
with two possible values (e.g., the spin of ferromagnetic materials). 
These random fields are generalizations of stochastic processes in which 
the time parameter t is replaced by an element of a multidimensionalspace, 
e.g., if t stands for the vertices of a d-dimensional cubic lattice and 
X(t) is a random variable for any t (in the voting model X(t) takes only 
two values) then X(f) is a random field. Just as we supposed that the 
voters' opinions are only influenced by those of their neighbours, in 
physics we may also assume (as a first approach) that each particle 15 
influenced only by its neighbours. This kind of random field is called 
Markov field (it is the equivalent of Markov chain). In studying ferro- 
magnetism, a special Markov field, namely the Ising model became very 
important mainly due to the studies of the Norwegian physicochemist 
N. Onsager in 1944. In the last few years Markov fields and especially 
the Ising model have been applied to help solve the problem of phase 
transitions. Though the exact notion of Markov field was only introduced 
in 1968 by the Soviet mathematician R. L. Dobrushin, the first description 
of the notion of phase and that of certain random fields had already been 
carried out much earlier with potential functions in J. W. Gibbs' book 
in 1902 (Elementary Principles of Statistical Mechanics, Yale Univ. 
Press). The description of Markov fields by potential functions is espe- 
cially important because phase transitions occur just when the potential 
does not determine uniquely the Markov field. In physical terms this 
means that there may be more than one phase present at the same tem- 
perature. The theory also explains why phase transitions are impossible 
over the critical temperature (even Onsager succeeded in determining 
the critical temperature). It is interesting that while there cannot occur 
a phase transition in the one-dimensional model, in the case of the two- 
dimensional one (on the square lattice) it is already possible. In the later 
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case, in spite of the symmetry of the potential function (the value of the 
function does not change if all “yes” states are replaced by “по” and 
vice versa), the Markov field itself is not symmetrical. It is due to this 
paradox (called symmetry-break) that ferromagnetic materials do not 
lose their magnetism below the critical temperature. 


(Ref.: Grofman, B., “Fair appointment and the Banzhaf index", The Amer. Math. 
Monthly, 88, 1—5, (1981). 
Kindermann, R., Snell, J. L., Markov Random Fields, Contemporary Math. 
Vol. 1, AMS, Providence RI, 1980. 
Preston, C. J., Gibbs States on Countable Sets, Cambridge Univ. Press, 1974. 
Sinai, J. G., Rigorous Results in the Theory of Phase Transitions, Akadémiai 
Kiadó, Budapest, 1982.) 


Chapter 4 


Paradoxes in the foundations of probability theory. 


Miscellaneous paradoxes 


“De natura Rationis non est res, ut 
contingentes; sed, ut necessarias, con- 
templari." 


(B. Spinoza, Ethica, Pars Secunda, 
Propositio XLIV) 


"Probability is the most important 
concept in modern science, especially as 
nobody has the slightest notion what 
it means." 
(Bertrand Russel, 
In a lecture, 1929) 


“Calcul des Probabilités. Premiére Le- 
con. 1. L'on ne peut guére donner une 
définition satisfaisante de la Probabi- 
lité...” 
(H. Poincaré, Calcul des 
Probabilités, 1896, p. 1.) 


“Му thesis, paradoxically, and a little 

provocatively, but nonetheless genuine- 

ly, is simply this: PROBABILITY 
DOES NOT EXIST.” 

(B. de Finetti, 

Theory of Probability, 1974) 


In 1900, at the International Mathematical Congress in Paris, David 
Hilbert considered the problem of the foundation of probability theory 
as one of the 23 most important unsolved problems in mathematics. 
Though by the turn of the century probability theory had produced many 
outstanding results, due to the lack of foundation, this theory as a whole 
could not join other branches of mathematics. This may be the main 
cause why F. Klein, a professor at University in Góttingen, did not even 
mention probability theory in his work “Mathematics of the 19th cen- 
tury”. Utilizing the results of a number of mathematicians, especially 
those of E. Borel, A. Lomnitzky, H. Steinhaus, and using set and measure 
theory, A. N. Kolmogorov developed the exact theory of probability in 
1933. (Details can be found in Archive for Hist. of Exact Sci., 18, 123— 
190, 1978.) The base of Kolmogorov’s theory is that every event (whose 
probability we want to obtain—these events are called observable events) 
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can be represented by a subset of the set of all elementary events (i.e., 
by a subset of the phase space). For instance, when tossing a dice, the 
outcomes can be 1, 2, ..., 6; these elementary events together form the 
phase space, and the event that the outcome is even can be represented 
by the subset of the phase space, consisting of the even numbers, (2, 4, 6). 
The certain event is represented by the entire phase space, which is tra- 
ditionally denoted by О. Kolmogorov's theory assumes that the observ- 
able events form a sigma-algebra (sigma refers to infinity), i.e., the joint 
occurrence of any two observable events, the occurrence of at least one 
of finite or countably infinite observable events, and the complement of 
any observable event is also an observable event. A nonnegative number 
is assigned to each observable event, this is the probability of the event, 
such that the probability of the certain event (i.e., of the entire phase 
space) is 1, and the sigma-additive property holds, i.e., in the case of 
pairwise exclusive events, the probability of the occurrence of at least 
one (and so, owing to the pairwise exclusion, exactly one) observable 
event in a collection of finite or countably infinite observable events is 
the same as the sum of the probabilities of the observable events in the 
collection. 

The question arises : When defining probability, why do we need sigma- 
algebras instead of the set of all the subsets of phase space 2? The answer 
is very simple: In general, the probability cannot be defined on the set 
of all the subsets of Q, more precisely, if probability is defined on a sigma- 
algebra consisting of some subsets of Q, then this probability may not be 
extended to the rest of the subsets of О if sigma-additivity is still required 
(unless Q consists of finite or countably infinite elements). G. Vitali knew 
this result as early as 1905. Let the phase space be the interval (0, 1), 
and make an attempt to define the probability on all the subsets of (0, 1) 
according to the “uniform distribution". Obviously, the probability 
b—a should be assigned to a subinterval (a,b) Thus, due to sigma- 
additivity, the probability is automatically defined on the least sigma- 
algebra containing the intervals. This probability can be extended to 
some other sets, but there also exist sets to which it cannot be extended, 
i.e., on which the probability cannot be defined according to the *'uni- 
form distribution". 
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Such a "pathological" set was constructed by E. Zermelo as follows. 
He selected the points of the interval (0, 1) into disjoint classes such that 
the points whose distance was rational belonged to the same class. Then, 
using the axiom of choice, he defined a set H that had exactly one point 
from each of the above classes. It can be proved that this set H cannot 
have any probability according to the “uniform distribution". 

It can also be shown that if we abandon ‘‘uniformness”’ but require that 
each subset of О have probability and the probability of each point in О be 
0, then even this kind of probability definition is impossible in the case of a 
phase space Q whose cardinality is countable or—assuming the continuum 
hypothesis—continuum (see G. Birkhoff, Lattice Theory, Amer. Math. 
Soc., Providence, 1967, p. 266). It is not yet known whether or not a 
space Q (with sufficiently large cardinality) exist such that there can be 
defined a probability fulfilling the above requirement. This is the prob- 
lem of measurable cardinalities. The situation changes crucially if we 
abandon the axiom of choice; see T. Jech; Set Theory, Acad. Press, 
New York, 1978 and К. M. Solovay; “А model of set theory in which 
every set of reals in Lebesgue measurable", Annals of Math., 1—56, 
(1970). 

Though in Kolmogorov's theory the probability is always a nonnegative 
number, several theorems in probability theory can be extended so that a 
negative number can be a probability, too. For example K. J. Hochberg 
(Proc. Amer. Math. Soc., 79, 298—302, 1980) proved that in the theorems 
obtained by such an extension of the central limit theorems, there occurs 
the real (both positive and negative) valued “density function" u,(t, x) 
which can be derived from the fundamental solutions to the following 
extension of the differential equation of heat conducting 

ди _ 1,107 u 
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n=2, 3, ... (the original differential equation of heat conducting is the 
case where n=1). This book, however, does not deal with negative nor 
complex valued probability measures, neither discusses other extensions 
of probability. 
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1. PARADOXES OF RANDOM NATURAL NUMBERS 


a) The history of the paradox 


In Kolmogorov's theory of probability it is impossible to choose a nat- 
ural (positive integer) number at random with uniform distribution, for 
if the probability of selecting, e.g., 1 is 0, then due to uniformity, the 
probability of choosing any other natural number is also 0. Thus sigma- 
additivity leads to a contradiction because the probability of choosing 
a natural number is 1 and not 0. On the other hand, if the probability of 
choosing 1 is positive then sigma-additivity leads again to a contradic- 
tion (the probability of the certain event would be infinite). In spite of 
this fact it is natural to expect that the probability of choosing an odd or 
an even number is 1/2. The next definition (which disregards of sigma- 
additivity) gives just this probability. Let K be an arbitrary subset of 
natural numbers and let k, denote the number of elements in K not greater 
than п. The relative frequency k,/n shows the probability of choosing a 
number from K provided we may choose with uniform distribution from 
the first numbers. If the limit of the relative frequency k,/n when n 
tends to infinity exists then this limit is called the probability of K. By 
this definition the probability of choosing an integer divisible by 2, 3, 
etc. is 1/2, 1/3, ..., respectively. The probability that two random inte- 
gers (chosen independently with uniform distribution) are relative primes 
can also easily be calculated. Supposing firstly that none of the integers 
is greater than n, the corresponding probability (depending on n) is cal- 
culated, and then its limit is considered as п + co. Cebyshev already showed 
in the last century that this limit is 6/1? (~2/3). Accordingly, if both 
the numerator and the denominator of a fraction are random natural 
numbers then it cannot be reduced with probability 6/z?. The following 
paradoxes also concern random natural numbers. According to J. E. 
Littlewood, the first is due to the famous physicist, E. Schródinger. In a 
1935 article F. P. Cantelli attributed the second paradox to P. Lévy. 
(P. Lévy was one of the most outstanding geniuses of probability theory. 
He came to occupy Poincarés and Hadamard's seat at the French 
Académie des Sciences.) 
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b) The paradoxes 


(i) One of two consecutive random natural numbers are drawn on the 
foreheads of two players 4 and B. The person with the smaller number 
loses and is obliged to pay the other as many dollars as shown on his 
own forehead. Both players have the right to veto, i.e., finding the 
number on the other's forehead too large, any of them can ask for a new 
game. (The number drawn on his own forehead is naturally unknown 
to the player.) However, following the reasoning below, none of them will 
veto. Both of them may think: “I can see the number k on my opponent's 
forehead. Therefore I have either k —1 or k+1. Eachcase is equally prob- 
able but if I lose, I pay only k —1 dollars, while if I win, 1 get k dollars, 
so it is not worth vetoing.” As the expected value of the prize is positive 
the game seems to be favourable to both players, which is of course 
impossible. 

(ii) Let us choose two random natural numbers X and Y independently, 
with uniform distribution. For any fixed (non-random) number x the 
probability of Y zx is 0. Similarly, for any fixed y the probability of 
X=y is 0. Consequently, the probability of both Y=X and XsY 
are also 0, which is impossible, for one of them is certainly true. 


c) The explanation of the paradoxes 


(i) The paradox is brought about the fact that there is no uniform 
distribution on the set of natural numbers. If the numbers written on the 
player's foreheads were at most 3-digit-numbers then there would already 
exist a uniform distribution on these numbers, but then the above reason- 
ing that led to the paradox would become completely false. 

(ii) No doubt, the probability of Y Sx is 0 for any fixed x (by the 
definition mentioned in the history of the paradoxes), but from this fact 
it does not follow that the probability of Y =X is also 0. It would follow 
only if the probability were sigma-additive, but this kind of probability 
(as we have mentioned) is not sigma-additive. 
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d) Remarks 


(i) Number theory and probability theory are in close connection. To 
illustrate how probabilistic ideas can be applied in number theory, we 
recall first that the relative frequency of primes among integers less 
than n is about 1/log n (if n is large enough). Supposing that primes are 
distributed randomly and independently among the first n numbers, 
the probability of choosing two primes close to n is about l/(log n)? 
(due to the independence). Let us consider an interval around n the 
length of which is c. (c is small compared to n, but large enough for sta- 
tistical considerations.) According to the above result, the number of 
twin primes (primes the difference of which is 2) belonging to this interval 
is c/(log n. A more detailed analysis (which takes into account, e.g., 
that an integer differing by 2 from a prime (+2) is certainly odd and there- 
fore more likely a prime itself) shows that the expected number of twin 
primes exceeds c/(log n)? by about 32%. Calculating on this basis, M. F. 
Jones, M. Lal and W. J. Blundon published a table in the Mathematics 
of Computation in 1967 which shows, e.g., that among the first 150 
thousand numbers greater than 100 million the expected number of twin 
primes is 584. Actually this number is 601. The difference is fairly small. 
Similarly, considering the first 150 thousand numbers after 100 trillion, 
we expect 191 twin primes, while the actual number is 186. This kind of 
“statistical” approach to primes produces fairly good results. It is of 
special interest because the number of twin primes (infinite or not) is 
still unknown. (The largest prime known up to the present is 28 — 1). 
Primes follow each other according to a very complicated seemingly 
random rule. This is why a probabilistic approach is excellent in this 
case. We will return to the relation of complexity and randomness with 
*the paradox of the Monte Carlo method". 

(ii) According to the finitely additive uniform distribution on the natu- 
ral numbers, the probability of any finite subset (of natural numbers) А 
is 0. Supposing now that the distribution is not uniform but it has the 
property that given an arbitrary positive number e, there is a finite set A 
whose probability is P(.4)— 1 —e, then the difference between additivity 
and sigma-additivity disappears. More precisely, if the probability P is 
additive on the finite subsets of natural numbers (or on any countable set 
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Q), then the probability can be extended to every subset in such a way 
that the extension becomes sigma-additive on the sigma-algebra of all 
subsets of natural numbers. 

If Q is not a countable set (e.g., the whole interval (0, 1)) then quite 
strange additive probabilities may occur. They may take only the values 
0 and 1, and they are defined on every subset of (0, 1) (supposing the 
axiom of choice in set theory). These probabilities are strange because 
countably many events with probability 1 may very unlikely to occur 
simultaneously, i.e., this probability might be 0. Similarly, at least one 
of countably many events with probability 0 may occur with probability 1. 


e) References 


Elliot, P., Probabilistic Number Theory, Springer, New York, 1980. 
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Carus, Math. Monographs, 12, Wiley, New York, (1959). 
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The following paper is a paradoxical approach of an outstanding unsolved problem. 

It claims that the Riemann hypothesis (which does not depend on chance!) is true with 

probability one: 

Good, І. J., Churchhouse, R. F., “The Riemann hypothesis and pseudorandom features 
of the Möbius sequence", Mathematics of Computation, 22, 857—864, (1968). 

A random analogue of Dirichlet's celebrated theorem on the infinitude of primes in 
arithmetic progressions is discussed in: 

Ruzsa, I. Z., Székely, G. J., “Intersections of traces of random walks with fixed sets", 
Annals of Probability, 10, 132—136, (1982). 


2. BANACH—TARSKI PARADOX 
a) The history of the paradox 


The uniform probability, or the corresponding length, area and volume 
in one, two, and three dimensions cannot be defined on arbitrary sets 
if the sigma-additivity of these measures is required. However. if we assume 
only additivity, (that is, the measure of the union of two disjoint sets 
equals the sum of their individual measures), then—-as the Polish mathe- 
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matician S. Banach showed 一 in one and two dimensions every bounded 
set becomes measurable (has length, or area). Thus, accordingly, uni- 
form probability can also be defined on every (bounded) set in one and 
two dimensions if we assume only the additivity of probabilities. Haus- 
dorff, however, showed in 1914 that such extension of measures in three 
dimensions is impossible. S. Banach and A. Tarski set forth a paradoxical 
theorem in 1924 which picturesquely showed that neither an additive 
measure (volume), nor the corresponding uniform probability can be 
defined on arbitrary bounded sets in three dimensions. 


b) The paradox 


Considering a ball of radius r=1 cm, it is possible to divide it into some 
finite number of pieces and then reassemble them to form a ball of radius 
R=1 km. In general, if А and B are bounded subsets of R? having non- 
empty interiors, then there exist a natural number 7 and partitions 
{A,;:1SjSn} and {В,:1=/=п} of A and B, respectively, into n pieces 
each, such that A, is congruent to B; for all j. (A subset X of R? is bounded 
if it is contained in some ball, and X has a nonempty interior, if it 
contains some ball. By a partition of a set X we mean a pairwise disjoint 
family of subsets of X whose union is X.) 


c) The explanation of the paradox 


If we chop a ball of radius r=1 cm into a finite number of pieces, we 
might intuitively expect that putting the pieces together, they can only 
form solid figures whose volume is equal to that of the original ball of 
radius 1 cm. This is, however, true only if we chop the ball into pieces 
which have volume. The point of the paradox is that in the three-dimen- 
sional space there are non-measurable sets, to which we cannot assign 
volume, if we want to keep the additive property of the volume, and if 
we want the volumes of two congruent sets to be equal. (The proof 
of the Banach—Tarski theorem depends on Zermelo's Axiom of Choice.) 
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d) Remarks 


Several outstanding mathematicians (for example, the Italian de Finetti) 
consider the sigma-additivity of probability a too strong restriction, but 
accept additivity. The Banach—Tarski paradox shows that changing 
sigma-additivity for additivity does not solve every problem and also 
brings new ones. In automata theory the dilemma of assuming or not 
assuming sigma-additivity became so critical, that even the Encyclopae- 
dia Britannica deals with the problem. Electronic computers are often 
used to generate (theoretically) infinite sequences of random numbers 
(cf. the next paradox). The probability of each sequence is zero, but the 
probability of their union is one. Thus the acceptance of sigma-additivity 
rests upon the tacit assumption that we cannot generate random phenom- 
ena by automata, i.e., random and non-random sequences are separated, 
which is just the well-known bifurcation of the Greek goddesses Tyche 
(chance) and Moira (fate). 


e) References 


Banach, S., Tarski, A. “Sur la décomposition des ensembles de points en parties re- 
spectivement congruentes", Fund. Math., 6, 244—277, (1924). 

Stromberg, K., “The Banach—Tarski paradox", The American Math. Monthly, 86, 
151—160, (1979). 


3. THE PARADOX OF THE MONTE CARLO METHOD 
a) The history of the paradox 


The Monte Carlo Method is a numerical method based on random 
sampling. In solving numerical problems there can frequently be found 
a probabilistic model where the unknown number appears. Then it is 
possible to solve the problem in such a way that we observe the outcomes 
of random experiments belonging to the probabilistic model so many 
times that we can estimate (from these outcomes) the unknown number 
with a prescribed accuracy. Though the idea of this method is quite old, 
its actual application dates back only to the invention of computers when 
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J. Neumann, S. Ulam and E. Fermi used it for the approximate solution 
of difficult numerical problems of nuclear reactions. The name of the 
method refers to the series of random numbers used here which, in 
principle, could also be the regularly announced results of games 
played in gambling houses, e.g., in Monte Carlo. In practice, however, 
the computer itself produces the random numbers necessary for 
the method. Consequently this nice name (first used by N. Metro- 
polis and S. Ulam in 1949) is totally misleading (the method is not of 
much help in trying to win in Monte Carlo). The idea of Monte Carlo 
method first appeared in a 1777 work by Buffon (see I. 11). It gives a 
method for the estimation of л by throwing a needle randomly. If par- 
allels are drawn on a table at unit distance and a needle of length L<1 
is thrown randomly on the table (the angle between the parallels and 
the needle, and the distance of the centre of the needle from any given 
parallel are independent and uniformly distributed over (0,21) and 
(一 1/2, 1/2), respectively) then the probability that the needle will inter- 
sect one of the parallels is 2L/z. If the experiment is carried out many 
times then the relative frequency of intersections will be very near to the 
theoretical probability 2L/z, and thus л can be calculated. This method 
of the approximation of z is only of theoretical importance since to get 
two-figure accuracy, several thousands of throws have to be made. (By 
another method z can be determined to one million figures, see G. Miel's 
article.) Buffon's needle problem shows that the Monte Carlo method is 
not suitable for very accurate calculations. Even to obtain results of 
two or three-figure accuracy, thousands or millions of experiments have 
to be made. It is obvious therefore that the Monte Carlo method only 
became applicable when experiments could be simulated by computers. 
Instead of needle-throwing, two independent random numbers were 
generated which determined the position of the (supposed) needle and 
whether it intersected the (supposed) parallels. As the computer is able 
to generate several millions of numbers a minute, it does not take too 
long to simulate millions of experiments that would otherwise take a 
life-time. 

The theory of generating random numbers by computers has become 
an important branch of mathematics. Instead of actual random numbers 
(which might be produced by any random physical process such as radio- 
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active decay), pseudo-random numbers, generated by deterministic 
computer algorithms, came into the limelight. 

In connection with pseudo-random numbers the following question 
arises. In what sense can they be considered random since they are gen- 
erated by deterministic (non-random) algorithms? Since von Mises 
article in 1919 several outstanding mathematicians have dealt with 
this problem. (Its philosophical aspects were studied by P. Kirschenmann 
and P. McShane, among others.) 


b) The paradox 


In 1965—66 Kolmogorov and Martin-Lóf put the notion of randomness 
in a new light. They defined when a series consisting of 0% and 1°з can 
be considered random. The main idea is the following. The more difficult 
it is to describe a series (i.e., the longer its “shortest” generating program 
is) the more random we may consider it. Naturally, the length of this 
*shortest" program may vary if we use different computers. For this 
reason a standard machine is chosen which is called Turing-machine. 
The measure of complexity of a series is the length of the shortest Turing 
program which can generate the series. Complexity is a measure of irreg- 
ularity. A series whose length is N is called random if its complexity 
is nearly maximal. (It can be shown that most series are of that kind.) 
As Martin-Lóf proved, these series can be considered random because 
they satisfy all the statistical tests of randomness. Complexity and ran- 
domness are therefore in close connection. If a programmer wants to 
generate “real” random numbers, then, due to Kolmogorov's and 
Martin-Lóf's results, he can only generate the series by a rather long pro- 
gram. At thesame time, in practice, random number generators are very 
short. How can these two things be reconciled? 


c) The explanation of the paradox 


Series generated by short programs and used as random numbers actually 
satisfy only a few criteria of randomness, not all. This, however, causes 
hardly any problems in application. For example, for the purpose of nu- 
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merical integration, it is enough to know that the pseudo-random num- 
bers are uniformly distributed over an interval. 

Suppose we want to integrate a function of bounded variation on the 
interval (0, 1). Then the number 


i= ree 


is approximated by the mean 
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not only if the series x;, xs, ..., xy is random and uniformly distributed 
over interval (0.1). It is enough to require that the series is uniformly 
distributed. This means that as №- о, 
Dy = sup |с(х, N)—x| 
0=х=1 

converges to 0, where с(х, №) is the quotient of the number of x1, xs, ... 
..., Xy belonging to (0, x) and N, i.e., the relative frequency. 

It can be shown that 
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where V , is a constant depending on function f (the total variation of f). 
From this it follows that the approximation of J is the more accurate the 
less Dy is. Dy, however, is not minimal in the case of random series. For 
random series the order of approximation is N71’, while in non-random 
cases an accuracy of N7" log N can be obtained. 

In many cases it turns out that instead of trying to cope with the 
“impalpable concept of randomness” one should use deterministic sequen- 
ces that are very well suited for given problems. This is the essence of the 
quasi-Monte-Carlo-method. 


d) Remarks 


(i) Recently connections between randomness and complexity have led 
to several interesting discoveries. In mathematics it has long been a gen- 
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eral practice to handle too complicated structures as if they were random 
(e.g., the behaviour of the complicated sequence of primes is frequently 
described by probabilistic laws). The concept that randomness cannot 
be distinguished from complexity is, however, such a revolutionary idea 
that it is significant even from a philosophical point of view. Using this 
concept, Spinoza's motto can be restated as follows: People prefer simple 
things to complicated ones—which is undoubtedly true. At the same time 
it is obvious that the more we try to understand nature the more we have 
to realize that not everything is simple. 

(ii) The application of random number series is rather wide-ranging. 
Numerical integration, numerical solution of differential equations, 
computer simulation of physical, chemical, biological, technical, and 
economic problems, etc. also require random sequences. They help to 
solve traffic, transport, and other optimalization problems, as well as 
creating astronomical models. The efficiency of different computer pro- 
grams can also be tested if the data are random numbers. 

Finally we should mention a completely different field of application, 
the computer art, where random number sequences offer millions of 
variations (random number sequences can of course correspond to 
series of sounds, colours, letters, etc.). From random sequences the com- 
puter filters out those that do not meet the rules recognized when studying 
sample models. If the computer works on the base of enough samples, 
the artistic result will be fairly good. Xenakis, the Greek composer has 
used, e.g., computer-made random sounds in his works. Several exhibi- 
tions have already been organized from computer graphics. (It has to be 
noted that not all computer graphics apply random sequences.) An inter- 
national organization of computer artists was founded in 1970. 
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4. THE PARADOX OF UNINTERESTING NUMBERS; 
AN INCALCULABLE PROBABILITY 


a) The history of the paradox 


Whether a number is interesting or uninteresting is completely subjective, 
but one can give an objective definition on the basis of the previous 
paradox. We shall consider a number interesting if its complexity (de- 
fined in the previous section) is small. Therefore rational numbers are 
interesting, for their decimals recur periodically; z and e are also interest- 
ing among irrational numbers, since their digits can be generated by a 
quite simple computer program. There exist however irrational numbers 
which are more irregular. Normal numbers, for example, have the follow- 
ing property: every decimal (and what is more, every group of fixed 
number of digits) occur with the same probability in the infinite sequence 
of their decimal expansion. Most of the irrational numbers are normal, 
but it is difficult to decide whether a particular number is normal or not. 
Thus, for example, it is not known whether x (whose first one million 
decimals were published 1974) or e are normal or not. At the same time 
there exists a very simple (but artificially constructed) example of a nor- 
mal number. In the early thirties D. G. Champernowne showed that the 
following number is normal: 


0.123 456 789 101 112 131 415 161 718 192 021 222 324 252 6... 


(the decimals are the consecutive natural numbers). The situation was 
similar in arithmetic more than hundred years ago, when Liouville 
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constructed a transcendental number for the first in 1844, that is, a num- 
ber which cannot be the solution of an algebraic equation with integral 
coefficients. л and e were only proved to be transcendental numbers as 
late as 1882 and 1873 by F. Lindemann and Ch. Hermite. 'The study of 
numbers regarding normality began only after the turn of the century, 
due especially to the researches of E. Borel. Since that time the investiga- 
tion of regularity and irregularity in the sequences of digits evolved into 
an interesting theory, especially after the researches of А. N. Kolmogorov, 
P. Marin-Lóf, R. J. Solomonoff and G. J. Chaitin. The following paradox 
is one among the many paradoxes in this field. 


b) The paradox 


In most numbers digits follow each other randomly, that is, most of 
the numbers are uninteresting in the following sense: the computer 
programs which produce these numbers are not much shorter than the 
numbers themselves. In spite of this most numbers cannot be proved 
to be uninteresting (in any system of axioms free from contradiction). 
There exist an infinite number of uninteresting numbers, but this can be 
proved only for a finite number of them. 


c) The explanation of the paradox 


Initially it may seem surprising that something that cannot be identified 
may exist, but similar phenomena appear not only in the world of mathe- 
matics. For example, if all the one hundred thousand seats in a stadium 
are occupied, but only ninety-nine thousand tickets have been sold, then 
it is clear that one thousand people have sneaked in without a ticket. The 
identification of these people, however, is hopeless (especially if the tickets 
were taken away from everybody at the entrance). Thus we are sure that 
there are one thousand people in the stadium who got in without a ticket, 
but we cannot prove that any particular person got in without a ticket. 
Phenomena such as this occur frequently in mathematics. 

It is not at all surprising that most of the numbers are uninteresting, 
if we consider the fact how difficult it is to “discover” any regularity even 
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in a seven-digit telephone number, to make it easier to memorize. In the 
case of one hundred, or thousand digit numbers this would be even more 
difficult for a larger percent of these numbers. So it is the second part of 
the paradox which is more surprising, especially, for those who are inex- 
perienced in the paradoxes of 20th century logic. Among these paradoxes 
G. G. Berry's is the nearest to ours. (This paradox was published for the 
first time seventy years ago in “Principia Matematica" by B. Russel and 
A. N. Whitehead.) The **computerized version" of Berry's paradox comes 
from Е. F. Beckenbach. It claims that uninteresting natural numbers may 
not exist, because then the smallest of these would be interesting. In other 
words: the smallest of the numbers which can be produced only by long 
computer programs can also be produced by a short program, and this 
is undoubtedly a contradiction. It must be assumed that some numbers 
can be uninteresting even if we cannot prove it. One can show that if 
the system of axioms and inference rules we use contains п bits of infor- 
mation, then the “uninteresting” property of a number cannot be proved 
if its information content is much more than п bits. 


d) Remark 


A very important criterion of the randomness of digits in a number is 
that they cannot be extrapolated, or predicted. The question is whether 
there exist any (not random) number which can be defined precisely but 
whose digits cannot be predicted. The question was answered in the 
affirmative in an example of Chaitin. Let a random heads-tails sequence, 
or a corresponding 0—1 sequence be the input of a certain computer, 
namely a Turing-machine. The probability that the Turing-machine will 
ever stop for a random input, defines the Chaitin number. (Theoretically 
the machine may work for an infinitely long time, because it does not 
receive order which would make it stop.) It can be proved that the Chaitin 
number is an “uninteresting” number whose decimals cannot be predicted. 
At the same time this “uninteresting” Chaitin number has very interesting 
properties. If, for example, we knew its first few thousand decimals, 
then we would also get the answers to some classical, unsolved problems 
of mathematics, such as the Fermat-conjecture or the Goldbach-conjecture. 
The Fermat conjecture, (which claims that the equation x"J4-y"—z" 
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cannot be solved for natural numbers x, y, z and n>2)*, could be proved 
or disproved theoretically by a computer program, the **Fermat-pro- 
gram" which would compute for given values of n and z if there exist 
any numbers x and y which are solutions to the Fermat equation. If we 
gradually increased the values of n and z, then every case would be 
checked. The computer would stop if it found a solution. If the computer 
ever stops, the conjecture is disproved, and if it never stops, the conjecture 
is true. The “only” problem is the following: no matter how long the 
computer has been already working for, we can never be sure that the 
machine will not stop in the next step. We could get round this problem 
if we knew the Chaitin-constant. Consider all the binary inputs of finite 
length and try to select the programs (inputs) which terminate the comput- 
er. First we try to see if the computer stops for the first program in the 
first step; then if it stops for the second program in the first step. Then 
we let the first program run till the second step, the third program till 
the first step, the second one till the second step and the first one till the 
third step, etc. If the computer stops for some binary input of length К, 
then in thought we put a 1/2* unit weight into a sack. Gradually there 
will be more and more weights in the sack and their sum will converge 
to the Chaitin-constant (since the probability that an arbitrary binary 
sequence of length k will occur іп a heads-tails sequence is 1/2%). Let т 
be the length of the binary **Fermat-program". We continue to run the 
programs until the difference between the Chaitin-constant and the accu- 
mulated weight in the sack is less than 1/2". If the Fermat-conjecture has 
not turned out to be false up till this time, it must be true, because if the 
*Fermat-program" terminated the computer later, then we would have 
to put a weight of 1/2" units into the sack, and this is in contradiction 
with the fact that we have approached the Chaitin-constant with an 
accuracy of more than 1/2". The Chaitin-constant contains the solutions 
(or the theoretical possibility of solutions) for all the problems which can 
be reduced to a stopping (halting) problem such as the one we have 
just discussed. 


, * Recently a German mathematician, G. Faltings has proved a very deep theorem 
implying that the possible (essentially different) solutions of the Fermat's equation 
are finite, a big step in Fermat’s direction. 
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5. THE PARADOX OF RANDOM GRAPHS 
a) The history of the paradox 


Structural problems in several fields of science (such as the problem of 
electrical network) can easily be demonstrated and solved by graphs, 
i.e., by points and lines connecting them. Points are called the vertices, 
lines are called the edges of the graph. Edges may also represent connec- 
tions depending on chance. That is why the research concerning the 
structure of random graphs is of great importance. The theory of ran- 
dom graphs is mainly due to the work of Paul Erdós and Alfréd Rényi. 

Suppose that a graph has n vertices and each edge is drawn with prob- 
ability p. independently of the existence of other edges. Let e>0 be an 
arbitrary number. In 1960 Erdós and Rényi proved that if 


Hs (1 一 5) logon 
n 


then the probability that the graph is connected converges to 0 as n 
increases; on the other hand, if 


(1+2) logan 
n 


"S 
IN 


then this probability converges to 1. (We say that a graph is connected 
if any vertex can be reached from any other vertex through the edges.) 
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Thus the probability 


logon 
n 


has a “dividing ridge" role. During the last two decades, the theory of 
random graphs was extended to graphs with infinite vertices. In connec- 
tion with these infinite graphs, Erdós and Rényi drew attention to the 
following paradox. 


b) The paradox 


We say that two graphs С, and С» are isomorphic if there exists a one- 
to-one correspondence between the vertices of G, and those of С» such 
that two vertices in С; are connected if and only if the corresponding 
vertices in G, are also connected. 

If two graphs are isomorphic then their vertices have the same car- 
dinality, but this is by no means a sufficient condition for isomorphism. 
However, if the cardinality is infinite, more precisely, if the cardinality 
of the vertices is the same as that of the integers and any two of them are 
connected with probability 1/2 independently of the other edges, then 
these graphs are isomorphic with probability 1. Consequently, in this 
sense all infinite random graphs are the same! 


c) The explanation of the paradox 


We say that a graph is universal if for any sequences иу, Us, ..., и, and 
0, , Us, ..., Uy Of vertices (different from each other) there exists a vertex w 
different from the ws and v’s such that w is connected with every u but 
with none of the v's. It is easy to show that if G, and G, are universal then 
they are also isomorphic. The probability that the random graphs in the 
paradox are not universal, however, is 0 (i.e., w exists with probability 1). 


d) Remarks 


Beside the research of random graphs, the analysis of other random 
structures (random matrices, random algebraic equations, random power 
series, etc.) has also led to several interesting results in the last few years. 
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E.g., N. B. Maslova proved the following theorem: If the coefficients X; 
of the random algebraic equation 


n 


2 Xz = 0 


j=1 
are independent and identically distributed random variables with 
expectation 0 (but are not identically 0) and 


Е(|Х 2+8) Eco 


for some positive e, then, asymptotically, the number of its real roots 
is normally distributed with expected value 


D 
—]n zn 
a 
and standard deviation 


2Yn(1—2z-3)In n. 
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6. THE PARADOX OF EXPECTATION 
a) The history of the paradox 


A well-known theorem of probability theory states that if X and Y are 
random variables with finite expectations, then the expectation of their 
sum exists and equals the sum of their expectations: E(X+Y)= 
=E(X)+E(Y). It can easily be shown that even if E(X) and E(Y) do 
not exist, but E(X+Y) exists, then E(X + Y) depends only on the distri- 
butions of X and Y, that is, E(X-- Y) can be determined without know- 
ing the joint distribution of X and Y. Surprisingly this is not true for 
three variables. 
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b) The paradox 


If X, Y and Z are arbitrary random variables, for which E(X-- Y +2) 
exists, then this expectation cannot always be determined knowing only 
the individual distributions of X, Y and Z. 


c) The explanation of the paradox 


Let us define the random variables X, Y and Z in two different ways. 
Their distribution will be the same in both cases, but the expectation 
E(X+Y+Z) will be different. 

Let U be uniformly distributed on the interval (0, 1). Then clearly 
1—U and V=(2U-1) are also uniformly distributed on (0, 1). If 


п 
and Z= —2X, then X--Y -Zz0 апа thus E(X+Y+Z)=0, whereas 
if 

л л 
X= «(5 u), Y= «(5а-0)) 
апа 


r 
Z=-2 tg (5 y) , 
then the inequality X¥+¥Y+Z>0 holds with probability one, so E(X + 
+ Y 4-Z) is also positive, more precisely, 


Е(Х+Ү+2) = 5m OE 


d) Remarks 


(i) Since E(X+Y+Z)=E((X+Y))+Z and E(X+Y+Z+W)= 
=E((X+Y)+(Z+ W)) the expectations of sums of three and four 
variables are uniquely determined by the two-dimensional distributions. 
It is not known, however, whether this is true for more than four random 
variables or not. 
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(ii) Ruzsa and Székely showed that it is possible to assign a real num- 
ber E(X) to every random variable X, so that this number equal their 
expectations if they exist, and are finite; and 


E(X+Y) = E(X) -E(Y) 


always holds if X and Y are independent. Our paradox shows that this 
kind of extended expectation does not exist for not necessarily independ- 
ent random variables [for the random variables defined in Section c) 
E(X+Y+2Z) should be zero since E(X)=E(Y) and E(Z)— —2E(X)]. 
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7. THE PARADOX OF THE FIRST DIGIT 


a) The history of the paradox 


About a century ago, in 1881, Simon Newcomb drew attention to an in- 
teresting empirical fact in the Amer. J. Math. This discovery was soon 
forgotten, however, and was only rediscovered 60 years later by Frank 
Benford, a physicist at the General Electric Company. The law was 
named after him. (Newcomb is not the only person to have been unfairly 
treated. The sarcastic law of Eponymy states that no scientific theorem 
or discovery is named after its original discoverer.) W. Weaver tells Ben- 
ford's story in “Lady Luck”: “I have been told that an engineer at the 
General Electric Company, some twenty-five years or so ago, was walking 
back to his office with a book containing a large table of logarithms. He 
was holding it at his side, spine down; and as he glanced down at the 
edges of the pages, he noticed that the book was dirtiest at the opening 
pages and became progressively cleaner—just as though the early parts 
of the book had been consulted a lot, the middle less, and the concluding 
part least of all. ‘But that’, he must have thought, ‘is ridiculous. That 
implies that people must frequently look up the logarithms of numbers 
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beginning with the digit 1，next most frequently numbers beginning 
with 2, and so on, and least frequently numbers beginning with 9. 

And this just cannot be so; because people look up the logarithms of 
all sorts of numbers, so that the various digits ought to be equally well 
represented." 


b) The paradox 


Consider a table, e.g., the table of the integer powers of 2 or any table of 
physical constants or a table of population statistics. It generally turns 
out that the first digit (40) of the numbers in the tables is not uniformly 
distributed on 1, 2, 3, ..., 9. 1 is the most frequent, then comes 2 and so 
on, 9 being the rarest. According to Benford, the relative frequency of 
the first digits not greater than К is not k/9 (which would mean uni- 
form distribution) but rather lg (k + 1) (where lg stands for log;;). Conse- 
quently, the relative frequency of 1, 2, ..., 9 is about 30%, 17%, ... 
TEN 08 (Benford's law can be put in another way, i.e., the mantissas of 
the logarithm of numbers are of about uniform distribution over the in- 
terval (0, 1).) Benford's law does not claim that 1 is the most frequent 
first digit in every table (anybody could create a table containing not a 
single 1) but that typically the tables contain more ones as first digit than, 
e.g., nines. 


c) The explanation of the paradox 


There are several probabilistic and non-probabilistic approaches to 
Benford's law. Consider first a non-probabilistic one. 

Let us examine the table of the powers of 2. The first digit of 2" is 1 
if there exists and integer s such that 10522"—2.10*. If n (and therefore 
5) is large enough then s/n is approximately equal to ig 2, which means 
that among the first n powers of 2 every lg 2-th begins with 1. Similarly, . 
the rate of powers of 2 beginning at most with К is about lg (k--1) as 
in Benford's law. 

The probabilistic approach is a bit more complicated. Again we have 
to start from the fact that the first nonzero digit of a positive random 
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number X is at most k if there exists an integer s such that 
10° = X < (k+1)-10°. 


Consequently, Benford’s law holds only if the probability of the frac- 
tional part {lg X) being at most lg (k 4- 1) is exactly lg (k+1). A sufficient 
condition is that the fractional part {lg X} is uniformly distributed over 
the interval (0, 1). Now the first question is the following: What condi- 
tions on the distribution of X imply that the distribution of {lg X} is 
approximately uniform? Secondly, why do the tables very often show 
this property? While the first question was discussed by several mathe- 
maticians (e.g., R. S. Pinkham and J. H. B. Kempermann) with fairly 
good solutions, the answers to the second one are not satisfactory. They 
frequently lead to confused “philosophies” even to number mysticism. 
According to Benford, e.g., while “Man” counts arithmetically: 1, 2, 3, ..., 
* Nature" automatically takes the logarithm of numbers and counts 
e^, е“, е“, .... Benford states that the data of nature are composed of 
geometrical series for which (just like for the powers of 2) Benford’s 
law holds. He also mentions several examples in the fields of science and 
technology which show the influence of Fechner’s law discovered in the 
19th century. According to this law, the relation between stimulus and 
sensation is logarithmic. Unfortunately, Benford’s analogies do not give 
a Satisfactory answer to the question. Further details can be found in 
R. A. Raimi’s survey article, where the author gives a very detailed ref- 
erence. 


d) Remarks 


(i) If we are asked for the probability p, that k (k=1, 2, ..., 9) is the 
first digit of a random entry from a table of numerical data and we 
suppose the existence of a definite solution and its scale invariance (noth- 
ing has been said about the scale units employed) then we arrive at 
the log-uniform distribution: p,-—lg (k--1)—1g k. 

(ii) Analyzing the second or third, etc. digits, we realize that the influ- 
ence of the **Benford's effect" can hardly be seen, if at all, i.e., the second, 
third, etc. digits are approximately uniformly distributed. 


14 Székely 
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8. THE PARADOX OF ZERO PROBABILITY; 
(CAN MANY NOTHINGS MAKE SOMETHING?) 


a) The history of the paradox 


The probability of an impossible event is zero, but the contrary is not 
true: the probability of an event may be zero without its being impossible. 
For example, the probability that we hit the very centre of the target is 
0, though it is not impossible. There is also a zero probability of hitting 
any of one thousand fixed points, though this seems more likely than 
hitting the center point. Therefore the question arises of whether it is 
possible to compare the “chance” of events with zero probability or not. 
The other problem is that the probability of hitting a particular point 
of the target is zero, but a marksman will certainly hit one of the points, 
so the union of events with probability zero may be an event with proba- 
bility one, that is many nothings can really make something. Is this, in 
fact, possible? This paradox is similar to Zenon's two and a half thousand 
year old paradox about the impossibility of moving. Zenon said that a 
flying arrow is still in every instance (or in other words, the displacement 
of the arrow during time intervals of zero length must be also zero), so it 
is unthinkable that it moves at all. The question is the same: how is it 
possible that adding many “nothings” result in “something”? Thus the 
essence of our paradox is several thousand years old, but its satisfactory 
explanation evolved only in the past decades, due to the researches of 
Abraham Robinson (1918—1974). 
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b) The paradox 


We choose a point at random in the interval (0, 1). Then the proba- 
bility that we have chosen exactly the point 1/2 is zero, just as the prob- 
ability that we have chosen any of the points 1/100, 2/100, 3/100, ... 
though the latter seems to be more likely. Is it really impossible to make 
a difference between the probabilities of the two events? 


c) The explanation of the paradox 


In the history of arithmetic more and more complex types of numbers 
have been introduced: natural numbers and fractions were followed by 
the zero, the negative, the real (—rational irrational) and the complex 
numbers. In the nineteensixties the set of numbers were further extended 
by introducing infinitely small numbers, the infinitesimals. The word 
“infinitesimal” itself had been used since the time of Newton and Leib- 
niz in differential and integral calculus, but only symbolically without 
a well defined meaning or foundations. For precisely this lack of founda- 
tions, infinitesimals were expelled from rigorous mathematics in the past 
century, but they did not disappear completely (as physicists used them 
continuously). Mathematicians changed over to the use of **epsilon-delta"' 
analysis, and this still describes the spirit of university educations. Ro- 
binson’s theory, however, builds a firm logical foundation “under” the 
use of infinitesimals and the students of the next century will probably 
be taught in the revived spirit of Newton's and Leibniz's original heuris- 
tics. (At the universities of Wisconsin and M. I. T., students can, if they 
like, choose Robinson's theory instead of the epsion-delta theory of 
Weierstrass.) Infinitesimals can usually be used in calculations similarly 
to other numbers. While division by zero is not allowed, the division by 
an infinitesimal is well defined: the reciprocal value of an infinitesimal 
is an infinitely large number, and conversely, the reciprocal value of an 
infinitely large number is always an infinitesimal. Before Robinson's 
theory we thought that rational and irrational (i.e., real) numbers en- 
tirely fill the number line. Examining one point of the number line under 
Robinson's “mathematical microscope", we see not only one point but 
a multitude of infinitesimals which are infinitely near to this point. 


14* 
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This image is called a **monad" out of respect for Leibniz. Many para- 
doxes can be solved by the infinitesimals, Zenon's paradox as well as 
the paradox of zero probability. The point is that we have to make a 
difference between zero and infinitesimally small numbers. It is possible, 
for example, to assign a probability to every subset of an interval and this 
probability is zero only for the empty subset corresponding to the im- 
possible event, and any other event will have positive, possibly infinitesi- 
mal, probability. Furthermore, considering a set А, whose probability 
is P(A) in traditional sense, will have a probability which differs from 
P(A) by at most an infinitesimal. (This new probability is not sigma- 
additive, only additive.) Now we may really say that the probability of 
choosing a single point, e.g., the centre of an interval is smaller than the 
probability of choosing one of two points: the difference is an infinitesi- 
mal. 


d) Remark 


Newton endeavoured to put the laws of nature into mathematical form, 
thus he arrived at the border between finite and infinite quantities, 
whereas Robinson apprehended infinity itself (having followed the 
example of G. Cantor and others), and made it familiar to everyday 
mathematics. 
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9. THE PARADOX OF INFINITELY DIVISIBLE 
DISTRIBUTIONS 


a) The history of the paradox 


The notion of infinitely divisible distributions was introduced by B. de 
Finetti in 1929. A distribution F is called infinitely divisible if for any 
positive integer п there exist п independent, identically distributed ran- 
dom variables such that the distribution function of their sum is just F. 
Let F, and F; be the distribution function of two independent random 
variables, and denote the distribution function of their sum by F,* Fy. 
The operation * is called convolution. Obviously 


(Fı * F)x Fs = Ex*(Fa* Fs), Fi* R= Fa* Е 


(which means algebraically that the distribution functions with the con- 
volution as operation form a commutative semigroup). The distribution 
function F is -infinitely divisible (by the above definition) if for every 
natural number 7 there exists a distribution function F, such that 


Е,* Е,*...* F, = Е. 
n times 


Among others, the normal, Poisson, and exponential distributions are 
infinitely divisible. The most important role of these distributions is 
that they appear as limit distributions of the sum of independent 
random variables. In 1936 Cramér proved that if the convolution 
of two distributions is normal then both distributions must be normal 
and thus infinitely divisible. Two years later the same result was 
obtained by Raikov for Poisson distributions. These results are surprising 
since they claim that normal distributions can only be a decomposed into 
normal ones and the same stands for Poisson distributions. What is 
even more surprising is that infinitely divisible distributions can be de- 
composed into components which are not infinitely divisible. 


b) The paradox 


There exist distribution functions which are not infinitely divisible, but 
their convolution is infinitely divisible. 
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c) The explanation of the paradox 


We will show that the exponential distribution can be decomposed into 
the convolution of distributions which are not infinitely divisible. Con- 
sider the exponential distribution with parameter A=1, its density func- 
tion is e~* if x>0 and 0 if x<0. This distribution is really infinitely 
divisible since the nth convolution power of the gamma-distribution of 
order 1/n (whose density function is 


1/n-1,—x 
x e і 
р ло, SE! 


and 0 if x<0) is just our exponential distribution. However, it can be 
decomposed not only into gamma-distributions (which themselves are 
infinitely divisible, too), but also into the convolution of two distributions 
one of which takes only the values k=0, 1, 2, ... with probability 2- “+, 
and the other is concentrated to the interval (0, 1), i.e., it takes values 
from (0, 1) .The latter distribution is not infinitely divisible. According 
to Remark (i), no distribution of bounded random variables can be infi- 
nitely divisible. So we know already that an infinitely divisible distribution 
may have not infinitely divisible convolution components. Now both 
components of the exponential distribution obtained above can be de- 
composed further so that each component concentrate to two points, 
more precisely, to 0 and an integer power of 2. These distributions (as 
every distribution concentrated to two points) are not only non-infinitely 
divisible, but just the contrary, they are irreducible (i.e., there is no way 
to decompose them unless one of the components is degenerated, i.e., 
concentrated to a single number). 


d) Remarks 


(i) We will show that the distribution function of a bounded random , 
variable cannot be infinitely divisible unless the random variable is de- 
generated (i.e., when it takes only a single value with probability 1; in 
this case its variance is 0). If the random variable X is bounded then there 
exists a number К such that |Х|<К. If the distribution function of X 
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is infinitely divisible, there are independent random variables X,, Xo, .. 
..., X, with the same distribution such that the distribution functions of 
X; 4- X; 4-...-- X, and X are identical. As the supremum and variance of 
the sum of independent random variables is the sum of the suprema and 
variances of the components, we have 


ІХ < K/n and D(X) = D(X)]Vn. 


Consequently, if D(X)z:0 and n is large enough then the variance of 
X; would be greater than the supremum of |X;j, which is impossible. 
Therefore, if X is bounded and infinitely divisible then D(X)=0, i.e., 
X is degenerated. 

(ii) Besides the normal, Poisson, and gamma-distributions, the log- 
normal distribution is also infinitely divisible. (It is the distribution of a 
positive random variable the logarithm of which is normally distributed — 
see Thorin's paper). Student's t-distribution and the Cauchy-distribution 
(the distribution of the quotient of two independent standard normally 
distributed random variables) are also infinitely divisible (see Lukacs's 
book and the papers by Grosswald, Epstein and Bondesson). 

(iii) The exponential distribution serves as an example for infinitely 
divisible distributions decomposable into a (countably) infinite convolu- 
tion of irreducible distributions. It is even more surprising that there 
exist infinitely divisible distributions which can be decomposed into the 
convolution of only two irreducible distributions (see Lévy's paper). 

(iv) Infinitely divisible distributions were characterized by Kolmogo- 
rov, Lévy and Hincin in the 1930es. It is easy to show that the distribution 
function of X,+X,+...+Xy is always infinitely divisible if Xi, Xo, ... 
are arbitrary nonnegative integer valued, independent, identically distrib- 
uted random variables and N is a Poisson distributed random variable 
independent of all X’s. At the same time it follows from Lévy's and Hin- 
éin’s theorem that every infinitely divisible distribution concentrated to 
the nonnegative integers must be of this kind. Despite the fact that char- 
acterization theorems of infinitely divisible distributions are already 
50 years old, the problem of characterization of infinitely divisible distri- 
bution having only infinitely divisible convolution components (the 
normal and the Poisson distributions belong to this class but, as we have 
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seen, the exponential distribution does not) is even now an Unsolved 
problem. 

(v) In probability theory the notion of infinite divisibility comes up 
not only in connection with convolution. Other important operations 
can also be defined between distribution functions. For example, the 
distribution function of the maximum of two independent random 
variables with distribution functions F,(x) and Р(х) is F(x). Fx). 
Therefore the product of distribution functions often appears in many 
probabilistic problems, e.g., in reliability theory when we want to obtain 
the probability distribution of life times in shunt connection. Obviously, 


for any natural number п and any distribution function F(x), VF 
is also a distribution function, thus every (one-dimensional) distribution 
function is infinitely divisible. In the case of more than one dimension, 
the characterization of infinitely divisible distributions is less trivial 
(see the paper by Balkama and Resnick). A third operation is the follow- 
ing. 

Let F,oF, denote the multiplicative convolution, i.e., the distribution 
function of the product of independent random variables with distribu- 
tion functions F, and F,. While the Poisson distribution is infinitely di- 
visible if the operation is the convolution x, it is not divisible if the oper- 
ation iso. Moreover, if X and Y denote independent random variables 
and XY is of Poisson distribution, then either X or Y is concentrated to 
the two-element set (0, 1) with probability 1. This means that the Poisson 
distribution iso-irreducible (see the paper by Székely and Zempléni). 
At the same time, the standard normal distribution iso-infinitely divis- 
ible, too. (Theo-infinite divisibility of normal distributions with positive 
expectation is not yet proved or disproved; if the expectation is negative, 
it is obviously noto-infinitely divisible.) 


e) References 


Balkama, A. A., Resnick, S. I., *Max-infinite divisibility", J. Appl. Prob., 14, 309— 
319, (1977). 

Bondesson, L., “A general result of infinite divisibility", Annals of Probability, 7, 
965—979, (1979). 


Paradoxes in the foundations of probability theory 205 


Epstein, B., “Infinite divisibility of Student's f-distribution", Sankhya, Ser. B., 39, 
103—120, (1977). 

Fisz, M., “Infinitely divisible distributions: recent results and applications", Annals 
of Math. Statist., 33, 68—84, (1962). 

Gónd6cs, F., G. Michaletzky, T. Mori, G. J. Székely, “A characterization of infinitely 
divisible Markov chains with finite state space", Алп. Univ. Sci. Budapest Sect. 
Math., 27, 137—141, (1985). 

Grosswald, E., “The Student f-distribution of any degree of freedom is infinite divis- 
ible", Zeitsch' Wahrsch' theorie verv. Geb., 36, 103—109, (1976). 

Lévy, P., “Sur les exponentielles de polinómes", Ann. Sci. École Normale Supérieure. 
54, 231—292, (1937). 

Lukacs, E., Characteristic Functions, Griffin, London, 1960. 

Steutel, F. W., "Infinite divisibility in theory and practice", Scand. J. Statist., 6, 
57—64, (1979). 

Székely, G. J. “Multiplicative infinite divisibility of the normal distribution", Proc. 
7th Brasov Conf. on Probab. Theory, Acad. Publ., Bucuresti, 579—582, 1984. 
Székely, С. J., Zempléni, A., “Advanced problem 6431", The American Math. Monthly, 

90, 402, (1983). 

Székely, G. J., “Problem 180." Statistica Neerlandica, 39, 324, (1985). 

Thorin, O., “Оп the infinite divisibility of the lognormal distribution", Scand. Acturial 
J., 121—148, (1977). 

Zolotarev, V. M., “Оп а general theory of multiplication of distributions of independ- 
ent random variables", Dokl. Acad. Sci. USSR, 132, 388—389, (1962), (in 
Russian). 


10. PARADOXES OF CHARACTERIZATION 
a) The history of the paradox 


The originator of the following problem is again George Polya. Consider 
two independent identically distributed random variables X and Y. Is it 
possible that aX --bY has the same distribution as X and Y if a and b 
are positive numbers? Polya analyzed this question in his paper published 
in 1923. The next remarkable result appeared only after a long interval 
in 1936, when E. C. Geary began to describe the distributions F that 
had the following property: if the variables X,, X;, ..., X, are independ- 
ent and follow the distribution F, then 


y = + +..+Х, 


and S= X(X,-Xy 
n int 
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are also independent. M. Kac in 1939 and S. N. Bernstein in 1941 answered 
the question: if X and Y are independent and identically distributed 
variables, then under what conditions are the variables X+ Y and X—Y 
independent? Since the forties, characterization has evolved into a very 
important part of probability theory, both theoretically and practically, 
after the work of such outstanding mathematicians as Yu. V. Linnik, 
E. Lukacs, A. A. Zinger, C. R. Rao and A. M. Kagan. 


b) The paradoxes 


Let X, Xs, ..., X, be independent, identically distributed random vari- 
ables. Is it possible, that Y —f(Xi, Xo, ..., An) and Z=g(X,, Xo, ..., Х,) 
are identically distributed, or independent if f and g are different—e.g., 
linear—functions? In certain cases, for example, if f (and consequently 
Y) is identically constant, then Y and Z are obviously independent no 
matter what the function g is, but “generally” we would expect that Y 
and Z are neither independent nor identically distributed. Surprisingly, 
however, exceptions turn up precisely in the most important cases, when 
the Y;s are normally distributed. If, for example, X, and X; represent 
the coordinates of the velocity vector of a point moving randomly in a 
plane, and X,, X, are independent standard normal variables, then the 
quantities Y —X?-- X2, (which is proportional to the kinetic energy) 
and Z= X,/X,, (which determines the direction of the motion) are inde- 
pendent. These kinds of properties often characterize normal distribu- 
tions (or other important distributions). 


c) The explanation of the paradoxes 


Let both f and g be linear functions: 


If there exist numbers a; and b; such that aipi is not always zero, and 
a;=b; does not hold for every i, but Y and Z are still identically distrib- 
uted and all the moments of X; are finite, that is, E(|X;|*) is finite for 
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every k, (k =0, 1, 2, ...), then the X;'s are normally distributed. This theo- 

rem of J. Marcinkiewicz is a generalization of Polya's theorem. (For fur- 

ther generalization we refer to the book by Kagan, Linnik and Rao.) 
The theorem of G. Darmois and V. R. Skitovic states that if 


n 
Jy Sy and 2 S95 bx, 
i=1 i=1 
are independent and a;b; does not equal zero for every i, then Xj's 
are normally distributed. 

The following generalization of Geary’s theorem (also involving non- 
linear functions) is very important in mathematical statistics. It states 
that if X and S are independent, and n=2, then the Xs are normally 
distributed. 


d) Remarks 


(i) While the independence of X and S is a very strong condition, their 
correlation coefficient r(X,S)=0 e.g., for all symmetric random 
variables X,, X,, ..., X, when the correlation exists. Though r(X, S) is 
always less than 1, its supremum is 1. For unimodal distributions the 
sharp upper bound is { 15/16. In proving this result one can apply the 
sharp inequality т? = (m, — т?) т, where m, = E(X — Е(Х))“, k=2, 3, 4. 

(ii) There are several interesting and natural ways of characterizing a 
family of distributions. For example, exponential distributions can be 
characterized by the following property: the entropy 


— f fœ) logs f(x) dx 


[ f(x) denotes the density function] is maximal for the exponential distri- 
butions among all distributions on the interval (0, о) which have a given 
expectation. Among distributions on the interval (— о, со) with given 
expectation and variance normal distributions have maximal entropy. 
On a finite interval the entropy is maximized by uniform distributions 
(without any further assumption). 
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11. PARADOXES OF FACTORIZATION 
a) The history of the paradox 


The basic theorems in classical probability theory (such as the laws of 
large numbers, theorems on limit distribution) concern the distribution 
of the sum of independent random variables on the basis of the proper- 
ties of its terms. The “converse” of these “composition” theorems are 
the “decomposition” or “factorization” theorems where the distribution 
of the sum is known and we want to gain some information on the pos- 
sible terms or “factors”. Such a decomposition result is Cramér's theorem 
which has already mentioned. It claims that all factors of a normal 
distribution are also normal. Both in composition and decomposition 
theorems the characteristic function of random variables plays an im- 
portant “technical role". The characteristic function of a random var- 
iable X is defined as the expectation of the complex random variable 
ех (i2 —1 and tis a real number), i.e., фх() = E(e"*). Every random 
variable has a characteristic function, which uniquely determines the 
distribution function of the variable. The characteristic function of the 
sum of independent random variables is the product of the characteristic 
functions of the terms. These properties make it clear why characteristic 
functions are so extremely important for the solution of composition 
and factorization problems. They were already used by А. Cauchy in 
1853 and А. M. Lyapunov at the turn of the century. Since the 1920s, 
mainly due to the work of G. Polya and P. Lévy, charactersitic functions 
have been used very frequently in solving composition problems. Since 
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the 1930s, due to the theorems of Cramér, Hincin and Raikov, the theory 
of decomposition has also evolved. There is no lack of surprising results 
or paradoxes in this field either (see the ones below). 


b) The paradoxes 


(i) There exist random variables X, Y and Z such that the probability 
distribution of X+ У is equal to that of X--Z but the distribution of Y 
and Z are not the same. This fact, first pointed out by Hincin in 1937, 
is rather surpising because if X is either a bounded random variable or 
its characteristic function is never equal to 0 (e.g., if it is infinitely divis- 
ible) then the distributions of Y and Z must also be equal. Owing to 
Hinéin's paradox, in general there is no sense to speak about the “rest” 
of a probability distribution after one of its factors has been cancelled 
because the remaining part is not unique. А great many difficulties are 
caused by this fact in the algebra of probability distributions. At the 
same time it is reasonable to ask what remains if a normal factor of a 
distribution is omitted, since the characteristic function of a normal 
distribution is never equal to 0 (the characteristic function of the stand- 
ard normal distribution is e7). A certain degree of caution, however, 
is required even in this case. Namely, there exist independent, identically 
distributed random variables X and Y that have no normally distributed 
factors (with positive variance) but the random variable X 4- Y has already 
got one. This result was first pointed out by D. Dugué and R. A. Fisher 
in 1948 (see their paper below). 

(ii) Let X1, X;, ..., X, be independent random variables whose distri- 
butions (not necessarily the same) are not known. We know, however, 
the distributions of the linear combinations 


Mc л; j521,2, و‎ 71; 


(cj is an arbitrary number). If there exist X1, Хз, ..., X, satisfying this 
system of equations and the determinant of the matrix (с) is not 0, 
then (since in this case Yi, Y,,... ,Y, determine X,, Xa, ..., X, uniquely) 
we might think that the distributions of X1, X,, ..., X, are also uniquely 
determined. As A. Rényi showed in 1950, this is not the case. 
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c) The explanation of the paradoxes 


(i) It can be shown that if the value of a function q(t) is nonnegative 
for any real t, 9(0)—1, ф(— = o9 (t) and for positive arguments t the 
function g(t) is convex and lim ф(ї)=0 as t>, then there exists a 
random variable whose characteristic function is g(t). Thus there is a 
random variable X with characteristic function 1—|t| if |t|1 and 0 
otherwise, and there also exist random variables Y and Z whose char- 
acteristic functions are the same on the interval |t|=1 but differ out- 
side this interval. Therefore if X, Y and Z are independent then 


Ox+y( = Фх+2(0), 


i.e., X+Y and X+Z have the same distribution, while the distributions 
of Y and Z are different. (We shall see another example in 13a). 

Slightly more can be proved than stated in the paradox. It can be shown 
that when w(t) is a periodical function with period length of 2 and 
y(t)=1—|t| whenever |t|-1, then there exists a random variable Y 
whose characteristic function is just w(t). From this it follows that 
Ф(ї)уу(ї)= ф(0)?, ie, there are independent random variables X, Y 
and Z such that the distribution of X-- Y is equal to that of ¥+Z, and 
the distributions of X and Z are the same while those of Y and Z differ. 

(ii) If the random variables Y; are given and the determinant |\c;,|| 40 
then the random variables X; are uniquely determined. The variables Y; 
can, however, be given in several different ways so that their distributions 
remain unchanged. Therefore it is not at all certain that the distributions 
of Y, uniquely determine the distributions of X; if only ||c;,||40 is 
supposed. 


d) Remarks 


Rényi proved that generally it is also necessary to suppose that ||c5, | 
is not equal to 0. If we know that 


leal 50 and |с, #0 


then under general conditions the uniqueness is also guaranteed (e. g., 
if the characteristic function of Y, is an entire function of order z2 
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and if there is a solution at all). This fact has very important practical 
consequences. We have seen in II. 13/c that two values can be obtained 
by two measurements more accurately (the variances are less) if they are 
not measured one by one but if first their sum, then their difference is 
measured. The case is similar if we want to know п different unknown 
quantities X,, X2, ..., X,. Greater accuracy can be achieved if certain 
linear combinations Y,, Ys, ..., Y, are measured. Generally it is reason- 
able to use such a matrix (су) which consist of +1 and 一 1 only. In this 
case obviously |c2,|—0, and therefore the distributions of У,, Ya, ... 
... Y, do not determine uniquely the distributions of X;, Xz, ..., Xn- 
As an example, let the distribution of both Y,=X,+X, and Y,—X,— X, 
be standard normal. Then Cramér's theorem states that X, and X, 
are also normally distributed with expected value 0. However, their 
variance is not determined uniquely. They merely satisfy the relation 
Р(Х) - D(X;)— 1. 
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12. THE PARADOX OF IRREDUCIBLE 
AND PRIME DISTRIBUTIONS 


a) The history of the paradox 


Irreducible numbers, i.e., integers (greater than one) that have only one 
and themselves for divisors, play a fundamental part in arithmetic. 
These numbers: 2, 3, 5, 7, 11, 13, 17, 19, ... are also prime numbers, i.e., 
if they are factors of the product of two natural numbers, then they are 
also factors of at least one of these numbers. Among natural numbers 
primes and irreducibles are the same and the numbers 2, 3, 5, 7, ... are 
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always called prime numbers. The fundamental theorem of arithmetic 
states that there is exactly one prime factorization of each integer greater 
than one (the order of the prime factors is disregarded). Thus prime num- 
bers in arithmetic are like building blocks or like atoms in the physical 
world. The most natural way of getting information about a complicat- 
ed structure is to break it down into atoms, hence it is understandable 
that the notions of irreducibility and primality have been extended to 
general algebraic structures (since the last century). These notions can 
also be interpreted for probability distributions: the role of natural num- 
bers is taken over by probability distributions and the role of multiplica- 
tion by convolution (for the definition of convolution see “The paradox 
of infinitely divisible distributions"). 

A distribution F is irreducible if F=G ¥ Н implies that one of G and 
Н is degenerated (i.e., concentrated to a single point with probability 
one; these distributions play the role of units). A distribution F is called 
a prime distribution if it is the factor of G x H only if itis also the factor of 
G ог Н. Hinéin proved in 1937 that every distribution is the convolution 
of an infinitely divisible distribution and a finite or countable convolution 
product of irreducible distributions, i.e., every characteristic function 
g(t) can be expressed in the following form: 


(t) = v(t) П ¢; (2), 


where y(t) and ф{(ї) are the characteristic functions of infinitely divisible 
and irreducible distributions, respectively. This is somewhat similar to 
the fundamental theorem of arithmetic but there is an important differ- 
ence: the factorization of distributions is not unique. If, for example 
the distribution F assumes the values 0, 1, 2, 3, 4, 5 with the same, 1/6, 
probability, then F can be decomposed to the convolution of irreducible 
distributions in two different ways: in the first decomposition the first 
factor assumes the values 0 and 1, the second factor assumes the values 
0, 2 and 4 with the same probability; in the second decomposition the 
first factor assumes the values 0 and 3, and the second factor assumes 
the values 0, 1 and 2 with equal probabilities. This ambiguity shows that 
the analogy between the “arithmetic” of numbers and probability 
distributions is not perfect. The following paradox will show a more 
considerable discrepancy. 
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b) The paradox 


Probability distributions with the operation of convolution form an 
algebraic structure that contains many irreducible distributions (e.g., 
every distribution concentrated on two points is irreducible), but it 
does not contain any prime distribution. So if we really regard primes as 
the “atoms” of probability distributions, then there are no atoms at all. 
1. Z. Ruzsa and G. J. Székely first pointed out this fact in 1979. 


c) The explanation of the paradox 


The fact that primes and irreducibles are usually different is not surpris- 
ing at all, it seems unexpected only because in the most familiar and 
important structure, among natural numbers the two notions are equiva- 
lent. In general, however, we can only say that a prime is always irreduc- 
ible but the opposite is not necessarily true. The coincidence of the 
two terms means (roughly speaking) that the factorization into irreduc- 
ible elements is unique. It was observed even by Hinéin in 1937 that the 
convolution decomposition of probability distributions into irreducibles 
is not unique, that is, not every prime is irreducible. The article of Ruzsa 
and Székely shows that there are no primes in this structure at all, and 
thus concerning the connection between the notions of irreducibility and 
primality, the multiplicative structure of natural numbers and the con- 
volution structure of probability distributions are opposite extremes. 


d) Remark 


Say that two distributions F and G are relatively prime if F and G can be 
the factors of a distribution H only if F*G is also a factor of H. In the 
article already referred to, we proved that bounded (and not degenerated) 
distributions cannot be relatively prime, and we conjecture that there are 
no relatively prime distributions at all; but this problem is still unsolved. 


15 Székely 
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13. QUICKIES 


a) The paradox of halving distributions 


Let X and Y be independent, identically distributed random variables. 
The distribution of X-- Y usually determine uniquely the common distri- 
bution of X and Y, but, paradoxically, not always. This fact is surpris- 
ing, because the distributions in practice can usually be divided 
uniquely, i.e., their nth proportion is determined uniquely (if it exists), 
as in the case of bounded or infinitely divisible distributions. Now 
let us see a paradox example. 

If a random variable is restricted to the values 2k+1 (k=0, +1, 
+2, ...), and the corresponding probabilities are 


4 
r (2k+1)? ， 


then its characteristic function ọ(f) is periodic with period 2r, and in 
the interval —rstsr 


ei 
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We now define another random variable which assumes zero with prob- 
ability 1/2 and assumes 4k +2 with probability 


eat t 
m(2k+1)? © 


The characteristic function W(t) of this random variable is also periodic. 
The length of its period is л, and in the interval —z/2ztzz/2 


214. 
t 
yd = 1-7 
Obviously v(f)—-|oe()| so wv(f?-o(t?. Thus if the characteristic 
function of X--Y 15 w(t)?=¢(t)?, then the common characteristic func- 
tion of X and Y may beeither o (t) or V (t), so it is not uniquely determined. 
We note that (Y ()+ọ(ġ)/2 is also a characteristic function, thus 


90-0 уды eO t 
2 


V (t), 
which gives another example of the first factorization paradox, since 
this equation cannot be reduced by (o(t) - V (1))/2. 

One can also construct characteristic functions so that they do not 
always assume real values, but their square is always real. Conse- 
quently there exist probability distributions which are symmetric 
to the origin, but “half of them" are not symmetric in the sense 
that there exist independent, identically distributed random variables 
X and Y such that X-- Y is symmetrically distributed, but X and Y 
are not. (In the case of bounded variables this may not happen.) It 
also seems surprising that if the random variable X has a symmetric 
density function f(x)(f(—x)—f(x) such that 0<a<f(x)<b< <o 
whenever |t|5c-ee and f(t)=0 for |t|>c, than X has no half, 
i.e., there is no characteristic function Фф such that gy=q?. (Ref.: 
Problem 10, Mat Lapok, 30, 1982, p. 272. (in Hungarian). Proposed by 
T. F. Móri and G. J. Székely.) 


15* 
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b) Pathological probability distributions 


(i) Let the distribution function F of a random variable have the follow- 
ing properties: F(0)—0, F(1)=1 and for 


х= 22-^, (а < a, d$ =... positive integers) 


F(x) = Sa (itay-*, 


where a is an arbitrary positive number. L. Takács showed (The Ameri- 
can Math. Monthly, 85, 35—37, 1978) that F(x) is a strictly monotone 
increasing and continuous function on the interval (0, 1). For a=1, 
F(x)=x on the interval (0, 1), that is, the random variable is uniformly 
distributed, its density function is zero except the interval (0, 1), where 
it is one. Surprisingly, if a1, then F (and the corresponding random 
variable) never has a density function, that is, there does not exist a 
function f, for which 


Е(х) = f f (и) du. 


Though the most frequent continuous probability distributions always 
have density functions, we must not forget about the pathological ran- 
dom variables we have just mentioned. It is interesting, for example, 
that if a uniformly distributed random variable is decomposed into the 
sum of two independent random variables whose distribution functions 
are continuous, then at least one of them is pathological, i.e., it does not 
have a density function. (The history of pathological and very patholog- 
ical, i.e., *singular" functions began in 1904, when H. Lebesgue pub- 
lished his book on integration of functions. One of the latest results on 
singular functions is due to T. Zamrifescu, “Most monotone functions 
are singular", The American Math. Monthly, 88, 47—49, (1981). See 
also F. S. Cater, “Most monotone functions are not singular", The 
American Math. Monthly, 89, 466—469, (1982).) | 
(ii) Let be the joint density function of X and Y be 


|x| TA 
h(x, y) = е лз 
2727 
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Then the density function of X is 
0, if x =0 


+оо 
AS T apu Ass el if x0. 
Clearly h is continuous, but f is not at zero. L. E. Clarke constructed an 
example (The American Math. Monthly, 82, 845—846, (1975)) where h 
is continuous everywhere, but f is nowhere continuous! It can easily be 


shown that fis always lower semi-continuous, i.e., 
lim fx) = f(x). 


Moreover, if the integral of a non-negative, semi-continuous function 
extended over the entire x axis is unity, then there exists a continuous 
density function h(x, y), such that 


Јо) = (А h(x, y) dy. 


(Ref.: Pelling, M. J., Verbeek, A., “Оп marginal density functions of continuous 
densities II", The American Math. Monthly, 84, 364—365, (1977).) 


c) The newsagent paradox 


A newsagent orders N dailies every day. He makes a profit of b dollars 
on every daily sold and has a loss of c dollars on every daily left over. 
Which N should he choose to maximize his expected profit? The numbers 
of customers naturally depends on chance. Suppose it follows the Poisson 
distribution with some parameter 4, that is, the probability that the num- 
ber of customers is exactly n is A"e~*/n!. If we put b=1, c=2 and 4-10, 
i.e., the average number of customers is 10, then one can show that the 
average number of dailies that should be ordered is 9. It is evident, how- 
ever, that if the newsagent orders only 9 dailies every day, the average 
number of customers will decrease from 10 to 9; but in this case the 
optimal number of dailies would only be 8, etc. The explanation of this 
paradoxical situation lies in the fact that we must take into account the 
loss, caused by losing a **potential customer”, who leaves disappointedly. 
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Let d dollars be the loss of the newsagent if he cannot serve a customer 
with a daily. (The value of d cannot be determined as simply as the 
values of b and c, hence—unfortunately—it is often neglected. For 
example, if d—1$, for 4—10 the optimal value of N is 10.) 

In general denote by X the (random) number of customers (now do not 
suppose that X is Poissonian). One can show that the optimal number 
of N is the solution of the equation РОМ) ‚ ie. if the 
newsagent stocks N copies of a daily paper where N is the solution of 
this equation, then his expected profit will be maximized. 


(Ref.: Morse, P. M., Kimball, G. E., Methods of Operations Research, Wiley, New 
York, 1951. 
DeGroot, M. H., Optimal Statistical Decisions, McGraw Hill, New York, 
1970.) 


d) Kesten's paradox 


According to Kolmogorov's strong law of large numbers, if X1, Xa, X,, ... 
are independent, identically distributed random variables, then the se- 
quence 
Lam X, TX... X, 

n 
converges to a constant M with probability one, if and only if the expec- 
tations of X;'s exist; then these expectations are just equal to M. Thus 
if M exist, then the sequence X, is very *regular" with probability one: 
its only accumulation point is M. To what extent can the sequence X, 
be “irregular” if M does not exist? Harry Kesten proved in 1970 that 
the set of the limit points of the sequence X, can be an arbitrary closed 
set (independent of chance) which contains —œ and œ with probability 
one. Accordingly, the set of limit points may be the entire number line, 
though it has not yet been established what kind of characteristics the ` 
distribution function of X;s must have to possess this property. 


(Ref.: Kesten, H., “The limit points of a normalized random walk”, 
Annals of Math Statist., 41, 1173—1205, (1970).) 
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e) The paradox of the stochastic geyser 


Alfred Rényi proposed the following question in 1962. Consider a geyser 
which is gushing at intervals X,, X;, ...; suppose these are independent, 
identically distributed random variables. We measure the times of 
gushes, S,=X,+X.+...+X,. How large can the measuring errors be 
if we want to determine the unknown distribution of X;’s with probability 
one, on the basis of our measurement data. This question is in close 
connection with the following problem. Let S,=X,+X2+...+X, and 
T,=Y,+¥2+...+Y, be the partial sums of independent, identically 
distributed random variables (X’s need not be independent of Y’s). 
How precisely can S, approach Т, if the distributions of X’s and У” 
are different? According to the strong law of large numbers, if (S,—T,)/n 
tends to zero, then the expected values of the X’s and Y’s (provided that 
they exist) cannot be different (with probability one). The well-known 
law of the iterated logarithm shows that if the standard deviations 
D(X) of X’s exist, then 


5,-Е(5) _ p 
nas V2nininn 


Consequently, if even (S,—T,,)/ Үп converges to zero, then the standard 
deviations of X's and Y's must be equal. The researches of Skorohod and 
Strassen—based on the theory of Brownian motion—led to the con- 
jecture that if, e.g., the X's are bounded and the Y's are normally distrib- 


uted, then |S,—T,| is at least Үп, (if n is large enough). Hence 


4 一 一 
(5„—Т„)/үп cannot converge to zero (with probability one). Relying 
upon these findings it was thought that the times of gushing of the sto- 
chastic geyser are enough to be measured with an error smaller than 


үл It was a great surprise that, after P. Révész and M. Csörgő had taken 
the initiative J. Komlós, P. Major and G. Tusnády showed in 1974 that 5, 
can be approximated by T, so well that even (S,—T,)/Inn remains 
bounded. Thus if we record the times of gushes, the measurement 
errors must be kept within the limit of In n. P. Bártfai showed that if the 
measurement error divided by In п converges to zero, then the distribution 
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of intervals between subsequent gushes can be determined with proba- 
bility one. 


(Ref.: M. Csörgő, P. Révész, Strong Approximations in Probability and Statistics, 
Akadémiai Kiadó, Budapest, 1981.) 


f) The paradox of probability in quantum physics 


The methods of probability theory were widely used in physics even in 
the last century. Classical statistical physics started from the idea that 
the equilibrium of a system (consisting of large number of particles) 
is the most probable state of the system. The methods of statistical phys- 
ics were thought to describe only approximately the macroscopic 
behaviour of a system. Through the probabilistic interpretation of 
quantum physics, however, chance and probability became a funda- 
mental part of physics as a whole. Probability has become a basic 
notion such as energy, for example, not merely some kind of approxima- 
tion which could be avoided in principle. Even Einstein was not pleased 
with this sweeping change in the foundations of physics, though he was not 
a bit conservative. He wrote in a letter to Max Born (who was awarded 
the Nobel Prize for his probabilistical interpretation of the quantum 
mechanical wave function) that he did believe in the existence of perfect 
laws of Nature: “God does not dice." In his answer, Born explained that 
instead of solving a great number of differential equations, in some cases 
one can obtain reasonable results by tossing dice. Since that time Born's 
conception has become dominant. Chance and probability are already 
accepted notions of physics. These changes also affected philosophy: 
mechanistic determinism lost its dominant importance. The present 
state of the world does not determine uniquely its future state. With our 
present knowledge we can determine only the probability of future 
events. This, however, does not mean agnosticism, since the laws of 
chance are recognizable (probability theory deals with exactly this). 
Paradoxically, the physical concept of probability is not simply the appli- ' 
cation of mathematical probability in physics. The motives and spirit 
of the two concepts are different. According to R. P. Feynman, who won 
the Nobel Prize for physics in 1965, the laws of quantum physics can be 
understood on the basis of probability theory that evolved from the 
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theory of games of chance if we apply the laws of probability theory in 
the case of large numbers of particles, but these laws do not explain the 
behaviour of a single electron or proton. The wave theory of de Broglie 
and Schrödinger and the uncertainty principle of Heisenberg led to the 
elaboration of a new quantum-probability theory between 1926 and 
1929, especially due to Born. Kolmogorov's mathematical theory of prob- 
ability was also built up about that time. The clarification of the rela- 
tion between the two kinds of probability theories began much later, 
about twenty years ago, especially with G. Mackey's work based on 
some earlier research of von Neumann. At last a general and unified 
probability theory has developed, which involves both classical and 
quantum probability theory (cf. Gudder's book). This solved a contra- 
diction and made it possible to outline a probability theory based on 
general event structures. 


(Ref.: Born, M., Natural Philosophy of Cause and Chance, Dover Pub., New York, 
1964; 
Gudder, S. P., Stochastic Methods in Quantum Mechanics, North-Holland, 
New York, 1979.) 


g) The paradox of cryptography 


Throughout the several thousand years in the history of cryptography, 
cryptanalysts have invented more and more cunning ciphers, and 
their adversaries have correspondingly outwitted them by discovering 
more and more efficient techniques for craching ciphers. Edgar Allan Poe, 
who fancied himself a skilled cryptanalyst, was convinced that ©... hu- 
man ingenuity cannot concoct a cipher which human ingenuity cannot 
resolve”. The first turning point in the history of cryptography was 
reached in the twenties, when “one-time pads" were discovered. These 
one-time ciphers were first used by the Germans and have been in gen- 
eral use for half a century. Different types of one-time pads are considered 
very efficient and are in constant use today in many countries, for special 
messages. The famous **hot-line" between Washington and Moscow 
also makes use of a one-time pad. These ciphers are really unbreakable 
in principle, since a different shift cipher is used to encode each symbol 
in the plaintext, each time choosing the shift at random. If the letter 
“е” was always encoded as “t”, it would be a simple substitution 
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cipher, easily broken by a statistical analysis (as “e” is the most frequent 
letter in many languages). If, however, “e” is encoded sometimes as “a”, 
sometimes as “с” and sometimes as **w" and the substituting letter is 
chosen at random, then this one-time cipher is uncrackable even in 
principle. The ciphertext does not disclose anything about itself. The 
disadvantage of this procedure is that one-time pads are used only once, 
for a single message. A brilliant discovery of Whitfield Diffie and Martin 
E. Hellman, both electrical engineers at Stanford University, revolution- 
ized the entire field of secret communication. Inspired by the mathematical 
theory of complexity they proposed a new kind of cipher in 1975 which 
is not unbreakable in principle but absolutely unbreakable in practice. 
More precisely, these new ciphers can be broken, but only by computer 
programs that run for millions of years. Surprisingly the encryption and 
decription procedure of Diffie and Hellmann are not symmetric, meaning 
that if only the method of encryption is known, it is computationally 
infeasible to discover the method of decoding, and this provides absolute 
secrecy. (This method of ciphering is made possible by what Diffie and 
Hellman call a trapdoor one-way function.) The secret can be locked and 
unlocked with different keys (and opening it requires a much finer key). 
The basic idea of one-way cipherment is very simple: two numbers can 
easily be multiplied by each other, e.g., the product of 101 and 211 can 
be calculated quickly, it is 21311; but if we want to find two integers 
greater than one, whose product is 21311, then it will take much more 
time to find that 101 and 211 is the only possible solution. Naturally 
there are computer algorithms for factoring numbers, but in the case of 
a 40—50 digit number, the running time required would be millions of 
years. On the basis of prime number theory, a simple trapdoor function 
was found: the enciphering key depends only on the product of two 
prime numbers, whereas to decipher the ciphertext the two prime num- 
bers have to be known, too. Let us go into more details about this 
trapdoor function! 

Let p and q be two, large, random prime numbers. The product n of ` 
these two numbers and another random number Е are the user's encipher- 
ing key (E, n), which does not have to be kept secret; it can be put in a 
public file, such as a telephone directory. To apply the key, a sender 
first converts his message into a string of numbers, which he then breaks 
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into blocks B,, B2, .... Each plaintext number B; must be between 0 
and n—1. The sender computes for each plaintext number В, the cipher- 
text number C;— BF modulo n, (that is, if the Eth power of Bi is divided 
by n, the residual is C;). This public-key cryptosystem is based on the fact 
that although finding large prime numbers (p and q) is computationally 
easy, factoring the product of two such numbers is at present computa- 
tionally infeasible, so knowing only (Е, n) and С;, it is hopelessly difficult 
to find B;. To decipher a ciphertext C,, Сз, ..., the user employs n and a 
secret deciphering key D derived from the prime factors p and q of n. 
D is the multiplicative inverse of E modulo (p--1)(q—1), that is, ED 
modulo (p — 1)(g — 1) is equal to 1. (The product (p — 1)(g — 1) is the num- 
ber of integers between 1 and n that have no common factor with n.) 
After all these the receiver can easily obtain B;: 


CP = (BF)? = B, 


mod n. This method was designed by Rivest, Shamir and Adleman and 
is called the RSA system. 


(Ref.: Hellman, M. E., “The mathematics of public-key cryptography”, Sci. Amer., 
241, 130—139, (1979). 
Simmons, G. J., “Cryptology, the mathematics of secure communication”, The 
Math. Intelligencer, 1, 233—246, (1979). 
Shamir, A., *A polinomial time algorithm for breaking Merkle—Hellman 
cryptosystems”, Research Announcement, 1982.) 


h) The paradox of poetry and information theory 


The last paradox in this book is a quotation from my late professor 
Alfréd Rényi. 

“Since I started to deal with information theory I have often meditated 
upon the conciseness of poems; how can a single line of verse contain 
far more ‘information’ than a highly concise telegram of the same length. 
The surprising richness of meaning of literary works seems to be in 
contradiction with the laws of information theory. The key to this para- 
dox is, I think, the notion of ‘resonance’. The writer does not merely 
give us information, but also plays on the strings of the language with 
such virtuosity, that our mind, and even the subconscious self resonate. 
A poet can recall chains of ideas, emotions and memories with a well- 
turned word. In this sense, writing is magic.” 


Chapter 3 


Paradoxology 


“On foundation we believe in the real- 
ity of mathematics, but of course 
when philosophers attack us with 
their paradoxes we rush to hide behind 
formalism and say, 'Mathematics is 
just a combination of meaningless 
Symbols,'..." 


(J. A. Dieudonné, 1970) 


Like most branches of science, mathematics is also the history of para- 
doxes. The greatest discoveries generally solve the greatest paradoxes 
(think of Darwin or Einstein) while they serve as sources for new ones as 
well. Socrates’ teaching method of perceiving new ideas through para- 
doxes is the most fundamental because the process of scientific cognition 
itself rests on paradoxes. 

For the development of deductive mathematics it was of fundamental 
importance that (in spite of the Pythagorean “all is number", i.e., integer 
number) there are distances (e.g., the diagonal and the side of a square) 
whose ratio is not the ratio of integer numbers, which means that this 
ratio is not a number in Pythagorean sense. (In modern terminology, it is 
not a rational number.) This paradox of **incommensurability" led to 
the dissolution of the Pythagorean school and the overshadowing of 
number mysticism, to Euclidean geometry (where the role of numbers was 
replaced by geometric figures), and to Plato's ‘mathematical idealism” 
(in practice ‘“‘incommensurability” cannot be tested directly, thus, accord- 
ing to Plato, experience cannot lead to real knowledge). The greatest 
paradox in the mathematics of the Middle Ages was that “nothing”, i.e., 
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nought should be considered something and denoted by a figure. In this 
way, due to the Indian-Arabic method of number writing，calculating 
became much easier. At the break of modern times several paradoxes 
were caused by negative and later complex numbers. E.g., one of them 
states that (—1):1 —1:(— 1) is impossible because the ratio of a smaller 
number to a greater one cannot be equal to that of a greater number to 
a smaller one. Modern times have brought several new paradoxes in all 
branches of mathematics from the solvability of algebraic equation on 
to Bolyai's geometry. It is interesting that already in the first half of the 
last century B. Bolzano from Prague devoted a whole book to the para- 
doxes of infinity (*Paradoxien des Unendlichen") though the most in- 
teresting paradoxes of infinity appeared only after G. Cantor's set theory 
published in 1872. Most leading mathematicians of the century, such as 
Gauss, Cauchy, Kronecker, Poincaré and others, rejected the notion of 
actual infinity and assigned only symbolical meaning to it. The founda- 
tion stone of modern mathematics is, however, Cantor's theory using 
the notion of actual infinity, though we have to emphasize that the 
“horror infiniti" has not yet vanished. In fact new paradoxes have increased 
the number of finitists. Similarly, the fear of randomness is still in the 
air. The mathematical paradoxes of infinity and randomness are extreme- 
ly important because these two concepts fundamentally influence our 
outlook and philosophical attitude. Probability theory has evolved as a 
symbolic counterpart of the random universe thus it is to be hoped that 
the paradoxes in this book will help the reader to find the best way 
through our random world. 
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“Rien ne m'est sûr que la chose incertaine; 
Obscur, fors ce qui est tout evident; 

Doute ne fais, fors en chose certaine; 
Science tiens a soudain accident ;" 


(F. Villon, Ballade Du Concours 
De Blois, 11—14) 
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Notations 


P(A) 
P(A) 

P(AB) 
P(A|B) 


D(x) 


16 Székely 


=1.2.3...л 


ы п(п— 1)(п—2)...(п—-К+1) 
a) dX 


probability of the event 4 

probability of the complement of the event A (A) 

probability of the joint occurrence of events A and B 
probability of the event A given that the event B has occurred 


和 二 和 十 十 和 
n 


estimator of the parameter 9 
or EX expectation (or expected value) of the random variable X 


standard deviation of the random variable X 


一 f e ‘t” ^! dt if the real part of the complex number z is positive; 
0 


Г(2+1)=2Г(2), thus Г(п+1)=п!; Г(1/2)= Vx 


1 x 
= yz / e "Iq, 
2m -= 


Table 1. The standard normal distribution function 


f edu [(—-x)21- 9(x)] 


л -= 


Ф(х) = 


ACTRESS ETCGREENEZOM x |ec| x | eo 


0,00 | 0,5000 | 0,51 | 0,6950 | 1,02 | 0,8461 | 1,53 | 0,9370 | 2,08 | 0,9812 
0,01 | 0,5040 | 0,52 | 0,6985 | 1,03 | 0,8485 | 1,54 | 0,9382 | 2,10 | 0,9821 
0,02 | 0,5080 | 0,53 | 0.7019 | 1,04 | 0,8508 | 1,55 | 0,9394 | 2,12 | 0,9830 
0,03 | 0,5120 | 0,54 | 0,7054 | 1,05 | 0,8531 | 1,56 | 0,9406 | 2,14 | 0,9838 
0,04 | 0,5160 | 0,55 | 0,7088 | 1,06 | 0,8554 | 1,57 | 0,9418 | 2,16 | 0,9846 
0,05 | 0,5199 | 0,56 | 0,7123 | 1,07 | 0,8577 | 1,58 | 0,9429 | 2,18 | 0,9854 
0,06 | 0,5239 | 0,57 | 0,7157 | 1,08 | 0,8599 | 1,59 | 0,9441 | 2,20 | 0,9861 
0,07 | 0,5279 | 0,58 | 0,7190 | 1,09 | 0,8621 | 1,60 | 0,9452 | 2,22 | 0,9868 
0,08 | 0,5319 | 0,59 | 0,7224 | 1,10 | 0,8643 | 1,61 | 0,9463 | 2,24 | 0,9875 
0,09 | 0,5359 | 0,60 | 0,7257 | 1,11 | 0,8665 | 1,62 | 0,9474 | 2,26 | 0,9881 
0,10 | 0,5398 | 0,61 | 0,7291 | 1,12 | 0,8686 | 1,63 | 0,9484 | 2,28 | 0,9887 
0,11 | 0,5438 | 0,62 | 0,7324 | 1,13 | 0,8707 | 1,64 | 0,9495 | 2,30 | 0,9893 
0,12 | 0,5478 | 0,63 | 0,7352 | 1,14 | 0,8729 | 1,65 | 0,9505 | 2,32 | 0,9898 
0,13 | 0,5517 | 0,64 | 0,7389 | 1,15 | 0,8749 | 1,66 | 0,9515 | 2,34 | 0,9904 
0,14 | 0,5557 | 0,65 | 0,7422 | 1,16 | 0,8770 | 1,67 | 0,9525 | 2,36 | 0,9909 
0,15 | 0,5596 | 0,66 | 0,7454 | 1,17 | 0,8790 | 1,68 | 0,9535 | 2,38 | 0,9913 
0,16 | 0,5636 | 0,67 | 0,7486 | 1,18 | 0,8810 | 1,69 | 0,9545 | 2,40 | 0,9918 
0,17 | 0,5675 | 0,68 | 0,7517 | 1,19 | 0,8830 | 1,70 | 0,9554 | 2,42 | 0,9922 
0,18 | 0,5714 | 0,69 | 0,7549 | 1,20 | 0,8849 | 1,71 | 0,9564 | 2,44 | 0,9927 
0,19 | 0,5753 | 0,70 | 0,7580 | 1,21 | 0,8869 | 1,72 | 0,9572 | 2,46 | 0,9931 
0,20 | 0,5793 | 0,71 | 0,7611 | 1,22 | 0,8888 | 1,73 | 0,9582 | 2,48 | 0,9934 
0,21 | 0,5832 | 0,72 | 0,7642 | 1,23 | 0,8907 | 1,74 | 0,9591 | 2,50 | 0,9938 
0,22 | 0,5871 | 0,73 | 0,7673 | 1,24 | 0,8925 | 1,75 | 0,9599 | 2,52 | 0,9941 
0,23 | 0,5910 | 0,74 | 0,7703 | 1,25 | 0,8944 | 1,76 | 0,9608 | 2,54 | 0.9945 
0,24 | 0,5948 | 0,75 | 0,7734 | 1,26 | 0,8962 | 1,77 | 0,9616 | 2,56 | 0,9948 
0.25 | 0,5987 | 0,76 | 0,7764 | 1,27 | 0,8980 | 1,78 | 0,9625 | 2,58 | 0,9951 
0,26 | 0,6026 | 0,77 | 0,7794 | 1,28 | 0,8997 | 1,79 | 0,9633 | 2,60 | 0,9953 
0,27 | 0,6064 | 0,78 | 0,7823 | 1,29 | 0,9015 | 1,80 | 0,9641 | 2,62 | 0,9956 
0,28 | 0,6103 | 0,79 | 0,7853 | 1,30 | 0,9032 | 1,81 | 0,9649 | 2.64 | 0,9959 
0,29 | 0,6141 | 0,80 | 0,7881 | 1,31 | 0,9049 | 1,82 | 0,9656 | 2,66 | 0,9961 
0,30 | 0,6179 | 0,81 | 0,7910 | 1,32 | 0,9066 | 1,83 | 0,9664 | 2,68 | 0,9963 
0,31 | 0,6217 | 0,82 | 0,7939 | 1,33 | 0,9082 | 1,84 | 0,9671 | 2,70 | 0,9965 
0,32 | 0,6255 | 0,83 | 0,7967 | 1,34 | 0,9099 | 1,85 | 0,9678 | 2,72 | 0,9967 
0,33 | 0,6293 | 0,84 | 0,7995 | 1,35 | 0,9115 | 1,86 | 0,9686 | 2,74 | 0,9969 
0,34 | 0,6331 | 0,85 | 0,8023 | 1,36 | 0,9131 | 1,87 | 0,9693 | 2,76 | 0,9971 
0,35 | 0,6368 | 0,86 | 0,8051 | 1,37 | 0,9147 | 1,88 | 0,9699 | 2,78 | 0,9973 
0,36 | 0,6406 | 0,87 | 0,8078 | 1,38 | 0,9162 | 1,89 | 0,9706 | 2,80 | 0,9974 
0,37 | 0,6443 | 0,88 | 0,8106 | 1,39 | 0,9177 | 1,90 | 0,9713 | 2,82 | 0,9976 
0,38 | 0,6480 | 0,89 | 0,8133 | 1,40 | 0,9192 | 1,91 | 0,9719 | 2,84 | 0,9977 
0,39 | 0,6517 | 0,90 | 0,8159 | 1,41 | 0,9207 | 1,92 | 0,9726 | 2,86 | 0,9979 
0,40 | 0,6554 | 0,91 | 0,8186 | 1,42 | 0,9222 | 1,93 | 0,9732 | 2,88 | 0,9980 
0,41 | 0,6591 | 0,92 | 0,8212 | 1,43 | 0,9236 | 1,94 | 0,9738 | 2,90 | 0,9981 
0,42 | 0,6628 | 0,93 | 0,8238 | 1,44 | 0,9251 | 1,95 | 0,9744 | 2,92 | 0,9982 ° 
0,43 | 0,6664 | 0,94 | 0,8264 | 1,45 | 0,9265 | 1,96 | 0,9750 | 2,94 | 0,9984 
0,44 | 0,6700 | 0,95 | 0,8289 | 1,46 | 0,9279 | 1,97 | 0,9756 | 2,96 | 0,9985 
0,45 | 0,6736 | 0,96 | 0,8315 | 1,47 | 0,9292 | 1,98 | 0,9761 | 2,98 | 0,9986 
0,46 | 0,6722 | 0,97 | 0,8340 | 1,48 | 0,9306 | 1,99 | 0,9767 | 3,00 | 0,9987 
0,47 | 0,6808 | 0,98 | 0,8365 | 1,49 | 0,9319 | 2,00 | 0,9772 | 3,20 | 0,9993 
0,48 | 0,6844 | 0,99 | 0,8389 | 1,50 | 0,9332 | 2,02 | 0,9783 | 3,40 | 0,9996 
0,49 | 0,6879 | 1,00 | 0,8413 | 1,51 | 0,9345 | 2,04 | 0,9793 | 3,60 | 0,9998 
0,50 | 0,6915 | 1,01 | 0,8438 | 1,52 | 0,9367 | 2,06 | 0,9803 | 3,80 | 0,9999 
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Table 2. The first twenty thousand digits of z. 
Is the occurrence of the decimals realy random, or do they obey a rule? 


л=3, + 
1415926535 8979323846 2643383279 5028841971 6939937510 
8214808641 3282306647 0938446095 5058223172 5359408128 
4428810975 6659334461 2847564823 3786783165 2712019091 
7245870066 0631558817 4881520920 9628292540 9171536436 
3305727036 5759591953 0921861173 8193261179 3105118548 
9833673362 4406566430 8602139494 6395224737 1907021798 
0005681271 4526356082 7785771342 7577896091 7363717872 
4201995611 2129021960 8640344181 5981362977 4771309960 
5024459455 3469083026 4252230825 3344685035 2619311881 
5982534904 2875546873 1159562863 8823537875 9375195778 
3809525720 1065485863 2788659361 5338182796 8230301952 
5574857242 4541506959 5082953311 6861727855 8890750983 
8583616035 6370766010 4710181942 9555961989 4676783744 
9331367702 8989152104 7521620569 6602405803 8150193511 
6782354781 6360093417 2164121992 4586315030 2861829745 
3211653449 8720275596 0236480665 4991198818 3479775356 
8164706001 6145249192 1732172147 7235014144 1973568548 
4547762416 8625189835 6948556209 9219222184 2725502542 
8279679766 8145410095 3883786360 9506800642 2512520511 
0674427862 2039194945 0471237137 8696095636 4371917287 
9465764078 9512694683 9835259570 9825822620 5224894077 
4962524517 4939965143 1429809190 6592509372 2169646151 
6868386894 2774155991 8559252459 5395943104 9972524680 
4390451244 1365497627 8079771569 1435997700 1296160894 
0168427394 5226746767 8895252138 5225499546 6672782398 
1507606947 9451096596 0940252288 7971089314 5669136867 
9009714909 6759852613 6554978189 3129784821 6829989487 
5428584447 9526586782 1051141354 7357395231 1342716610 
0374200731 0578539062 1983874478 0847848968 3321445713 
8191197939 9520614196 6342875444 0643745123 7181921799 
5679452080 9514655022 5231603881 9301420937 6213785595 
0306803844 7734549202 6054146659 2520149744 2850732518 
1005508106 6587969981 6357473638 4052571459 1028970641 
2305587631 7635942187 3125147120 5329281918 2618612586 
9229109816 9091528017 3506712748 5832228718 3520935396 
6711136990 8658516398 3150197016 5151168517 1437657618 
8932261854 8963213293 3089857064 2046752590 7091548141 
2332609729 9712084433 5732654893 8239119325 9746366730 
1809377344 4030707469 2112019130 2033038019 7621101100 
2131449576 8572624334 4189303968 6426243410 7132269780 
6655730925 4711055785 3763466820 6531098965 2691862056 
3348850346 1136576867 5324944166 8039626579 7877185560 
7002378776 5913440171 2749470420 5622305389 9456131407 


16* 
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Table 2. cont. 
л=3, + 
6343285878 5698305235 8089330657 5740679545 7163775254 
0990796547 3761255176 5675135751 7829666454 7791754011 
9389713111 7904297828 5647503203 1986915140 2870808599 
8530614228 8137585043 0633217518 2979866223 7172159160 
9769265672 1463853067 3609657120 9180763832 7166416274 
6171196377 9213375751 1495950156 6049631862 9472654736 
6222247715 8915049530 9844489333 0963408780 7693259939 
5820974944 5923078164 0628620899 8628034825 3421170679 
4811174502 8410270193 8521105559 6446229489 5493038196 
4564856692 3460348610 4543266482 1339360726 0249141273 
7892590360 0113305305 4882046652 1384146951 9415116094 
0744623799 6274956735 1885752724 8912279381 8301194912 
6094370277 0539217176 2931767523 8467481846 7669405132 
1468440901 2249534301 4654958537 1050792279 6892589235 
5187072113 4999999837 2978049951 0597317328 1609631859 
7101000313 7838752886 5875332083 8142061717 7669147303 
1857780532 1712268066 1300192787 6611195909 2164201989 
0353018529 6899577362 2599413891 2497217752 8347913151 
8175463746 4939319255 0604009277 0167113900 9848824012 
9448255379 7747268471 0404753464 6208046684 2590694912 
2533824300 3558764024 7496473263 9141992726 0426992279 
5570674983 8505494588 5869269956 9092721079 7509302955 
6369807426 5425278625 5181841757 4672890977 7727938000 
1613611573 5255213347 5741849468 4385233239 0739414333 
5688767179 0494601653 4668049886 2723219178 6085784383 
7392984896 0841284886 2694560424 1965285022 2106611863 
4677646575 7396241389 0865832645 9958133904 7802759009 
2671947826 8482601476 9909026401 3639443745 5305068203 
5709858387 4105978859 5977297549 8930161753 9284681382 
8459872736 4469584865 3836736222 6260991246 0805124388 
4169486855 5848406353 4220722258 2848864815 8456028506 
6456596116 3548862305 7745649803 5593634568 1743241125 
2287489405 6010150330 8617928680 9208747609 1782493858 
2265880485 7564014270 4775551323 7964145152 3746234364 
2135969536 2314429524 8493718711 0145765403 5902799344 
8687519435 0643021845 3191048481 0053706146 8067491927 
9839101591 9561814675 1426912397 4894090718 6494231961 
6638937787 0830390697 9207734672 2182562599 6615014215 
6660021324 3408819071 0486331734 6496514539 0579626856 
4011097120 6280439039 7595156771 5770042033 7869936007 
7321579198 4148488291 6447060957 5270695722 0917567116 
5725121083 5791513698 8209144421 0067510334 6711031412 
3515565088 4909989859 9823873455 2833163550 7647918535 


233 


Table 2. cont. 


л=3, + 
6549859461 6371802709 8199430992 4488957571 2828905923 
5836041428 1388303203 8249037589 8524374417 0291327656 
4492932151 6084244485 9637669838 9522868478 3123552658 
2807318915 4411010446 8232527162 0105265227 2111660396 
4769312570 5863566201 8558100729 3606598764 8611791045 
8455296541 2665408530 6143444318 5867697514 5661406800 
1127000407 8547332699 3908145466 4645880797 2708266830 
2021149557 6158140025 0126228594 1302164715 5097925923 
2996148903 0463994713 2962107340 4375189573 5961458901 
0480109412 1472213179 4764777262 2414254854 5403321571 
7716692547 4873898665 4949450114 6540628433 6639379003 
8888007869 2560290228 4721040317 2118608204 1900042296 
4252308177 0367515906 7350235072 8354056704 0386743513 
7805419341 4473774418 4263129860 8099888687 4132604721 
5695162396 5864573021 6315981931 9516735381 2974167729 
4037014163 1496589794 0924323789 6907069779 4223625082 
5578297352 3344604281 5126272037 3431465319 7777416031 
3162499341 9131814809 2777710386 3877343177 2075456545 
3166636528 6193266863 3606273567 6303544776 2803504507 
9456127531 8134078330 3362542327 8394497538 2437205835 
0408591337 4641442822 7726346594 7047458784 7787201927 
8350493163 1284042512 1925651798 0694113528 0131470130 
9562586586 5570552690 4965209858 0338507224 2648293972 
4803048029 0058760758 2510474709 1643961362 6760449256 
2901618766 7952406163 4252257719 5429162991 9306455377 
2540790914 5135711136 9410919139 2351910760 2082520261 
2784768472 6860849003 3770242429 1651300500 5168323364 
1960121228 5993716231 3017114448 4640903890 6449544400 
2283345085 0486082503 9302133219 7155184306 3545500766 
4611996653 8581538420 5685338621 8672523340 2830871123 
2184564622 0134967151 8819097303 8119800497 3407239610 
9567302292 1913933918 5680344903 9820595510 0226353536 
2711172364 3435439478 2218185286 2608514006 6604433258 
0516553790 6866273337 9958511562 5784322988 2737231989 
4460477464 9159950549 7374256269 0104903778 1986835938 
3634655379 4986419270 5638728317 4872332083 7601123029 
0126901475 4668476535 7616477379 4675200490 7571555278 
2772190055 6148425551 8792530343 5139844253 2234157623 
2276930624 7435363256 9160781547 8181152843 6679570611 
3776700961 2071512491 4043027253 8607648236 3414334623 
9164219399 4907236234 6468441173 9403265918 4044378051 
1266830240 2929525220 1187267675 6220415420 5161841634 
9104140792 8862150784 2451670908 7000699282 1206604183 


234 


9598470356 


5771028402 
5178609040 
6161528813 
9203767192 
8244625759 
3408005355 
4043523117 
7261507981 
7429958180 
9289647669 


9588970695 
6909411303 
6922210327 
2248261177 
2543709069 
5510500801 
1596131854 
3084076118 
5020141020 
2645600162 


4786724229 
2168895738 
9906655418 
3220777092 
7723554710 
3114771199 
7152807317 
4781643788 
8584783163 
2742042083 


9914037340 
8798531887 
3503895170 
6198690754 
8282949304 
2827892125 
3685406643 
1920419947 
8856986705 
8757141595 


1465741268 
9113679386 


2226293486 


7998066365 
7086671149 
8437909904 
2033229094 
1633303910 
9849175417 
6006651012 
2554709589 
7247162591 
7583183271 


3653494060 
1509526179 
4889218654 
1858963814 
7939612257 
9086996033 
3475464955 
3013052793 
6723585020 
3742880210 


2465436680 
3798623001 
7639792933 
1201905166 
5859548702 
2606381334 
6790770715 
5185290928 
0577775606 
2085661190 


4328752628 
7058429725 
2989392233 
8516026327 
1377655279 
0771262946 
1939509790 
4553859381 
4315470696 
7811196358 


0492564879 
2708943879 


71-3, + 


0034158722 


8254889264 
6558343434 
2317473363 
3346768514 
7225383742 
3818839994 
4120065975 
0455635792 
6685451333 
3142517029 


3402166544 
3780029741 
3648022967 
0918390367 
1429894671 
0276347870 
6978103829 
2054274628 
0724522563 
9276457931 


0980676928 
5937764716 
4419521541 
0962804909 
7908143562 
6776879695 
7213444730 
5452011658 
8887644624 
6254543372 


8896399587 
9167781314 
4517220138 
5052983491 
3975175461 
3229563989 
1906996395 
0234395544 
5747458550 
3300594087 


8556145372 
9362016295 


9805349896 


8802545661 
7693385781 
9480457593 
2214477379 
1821408835 
4697486762 
5851276178 
1221033346 
1239480494 
6923488962 


3755890045 
2076651479 
8070576561 
3672220888 
5435784687 
8108175450 
3097164651 
6540360367 
2651341055 
0657922955 


2382806899 
5122893578 
3418994854 
2636019759 
4014517180 
9703098339 
6057007334 
3934196562 
8246857926 
1315359584 


9475729174 
9699009019 
1280696501 
8740786680 
3953984683 
8989358211 
5255300545 
9597782779 
3323233421 
3068121602 


3478673303 
1541337142 
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5022629174 


0172967026 
7113864558 
1493140529 
3937517034 
0865739177 
6551658276 
5838292041 
6974992356 
7079119153 
7668440323 


6328822505 
3942590298 
5144632064 
3215137556 
8861444581 
1193071412 
4384070070 
4532865105 
9240190274 
2498872758 


6400482435 
6015881617 
4473456738 
8828161332 
6246436267 
1307710987 
9243693113 
1349142415 
0395352773 
5068772460 


6426357455 
2116971737 
1784408745 
8818338510 
3936383047 
6745627010 
0580685501 
0237421617 
0730154594 
8764962867 


9046883834 
4892830722 


235 
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1965362132 3926406160 1363581559 0742202020 3187277605 
3610642506 3904975008 6562710953 5919465897 5141310348 
0861533150 445212 7473 9245449454 236828 8606 13408 41486 
5189757664 5216413767 9690314950 1910857598 4423919862 
3338945257 4239950829 6591228508 5558215725 0310712570 
8475651699 9811614101 0029960783 8690929160 3028840026 
7180653556 7252532567 5328612910 4248776182 5829765157 
8788202734 2092222453 3985626476 6914905562 8425039127 
6407655904 2909945681 5065265305 3718294127 0336931378 
7367812301 4587687126 6034891390 9562009939 3610310291 
7634757481 1935670911 0137751721 0080315590 2485309066 
4366199104 0337511173 5471918550 4644902636 5512816228 
1509682887 4782656995 9957449066 1758344137 5223970968 
5848358845 3142775687 9002909517 0283529716 3445621296 
9748442360 8007193045 7618932349 2292796501 9875187212 
3025494780 2490114195 2123828153 0911407907 3860251522 
2673430282 4418604142 6363954800 0448002670 4952482017 
2609275249 6035799646 9256504936 8183609003 2380929345 
4525564056 4482465151 8754711962 1844396582 5337543885 
9695946995 5657612186 5619673378 6236256125 2163208628 
9279068212 0738837781 4233562823 6089632080 6822246801 
0037279839 4004152970 0287830766 7094447456 0134556417 
2314593571 9849225284 7160504922 1242470141 2147805734 
2339086639 3833952942 5786905076 4310063835 1983438934 
7360411237 3599843452 2516105070 2705623526 6012764848 
7065874882 2596815793 6789766974 2205700596 8344086973 
2162484391 4035998953 5394590944 0704691209 1409387001 
4610126483 6999892256 9596881592 0560010165 5256375678 
5667227966 1988578279 4848855834 3975187445 4551296563 
7634180703 9476994159 7915945300 6975214829 3366555661 
4599041608 7532018683 7937023488 8689479151 0716378529 
1384987137 5704710178 7957310422 9690666702 1449863746 
1963403911 4732023380 7150952220 1068256342 7471646024 
0273900749 7297363549 6453328886 9844061196 4961627734 
4810688732 0685990754 0792342402 3009259007 0173196036 
4495397907 0903023460 4614709616 9688688501 4083470405 
2431974404 7718556789 3482308934 1068287027 2280973624 
0630792615 9599546262 4629707062 5948455690 3471197299 
3831591256 8989295196 4272875739 4691427253 4366941532 
0298865925 7866285612 4966552353 3829428785 4253404830 
0598135220 5117336585 6407826484 9427644113 7639386692 
6977631279 5722672655 5625962825 4276531830 0134070922 
7099558561 1349802524 9906698423 3017350358 0440811685 
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2669178352 5870785951 2983441729 5351953788 5534573742 
1080485485 2635722825 7682034160 5048466277 5045003126 
2432221188 5159740547 0214828971 1177792376 1225788734 
2778944236 2167411918 6269439650 6715157795 8675648239 
8783419689 6861181558 1587360629 3860381017 1215855272 
6435495561 8689641122 8214075330 2655100424 1048967835 
3100735477 0549815968 0772009474 6961343609 2861484941 
0558642219 6483491512 6390128038 3200109773 8680662877 
8936793008 1699805365 2027600727 7496745840 0283624053 
6640075094 2608788573 5796037324 5141467867 0368809880 
7391253835 5915031003 3303251117 4915696917 4502714943 
2342548326 1119128009 2825256190 2052630163 9114772473 
1112635835 5387136101 1023267987 7564102468 2403226483 
9630878154 3221166912 2464159117 7673225326 4335686146 
8850280054 1436131462 3082102594 1737562389 9420751367 
4725470316 5661399199 9682628247 2706413362 2217892390 
7386480584 7268954624 3882343751 7885201439 5600571048 
3604103182 3507365027 7859089757 8272731305 0488939890 
9077271307 0686917092 6462548423 2407485503 6608013604 
5823626729 3264537382 1049387249 9669933942 4685516483 
5220935701 6263846485 2851490362 9320199199 6882851718 
5982311225 0628905854 9145097157 5539002439 3153519090 
7806101371 5004489917 2100222013 3501310601 6391541589 
0594741234 1933984202 1874564925 6443462392 5319531351 
9149419394 0608572486 3968836903 2655643642 1664425760 
8153740996 1545598798 2598910937 1712621828 3025848112 
2994972574 5303328389 6381843944 7107794022 8435988341 
1000162076 9563684677 6413017819 6593799715 5746854194 
6778113949 8616478747 1407932638 5873862473 2889645643 
2961273998 4134427260 8606187245 5452360643 1537101127 
5829486191 9667091895 8089833201 2103184303 4012849511 
7253558119 5840149180 9692533950 7577840006 7465526031 
0425845988 4199076112 8725805911 3935689601 4316682831 
7903688786 8789493054 6955703022 6190095020 7643349335 
1786227363 7169757741 8302398600 6591481616 4049449650 
4434803966 4205579829 3680435220 2770984294 2325330225 
5678736400 5366656416 5473217043 9035213295 4352916941 
0234529244 9773659495 6305100742 1087142613 4974595615 
4595280824 3694457897 7233004876 4765241339 0759204340 
3354400515 2126693249 3419672977 0415956837 5355516673 
4951827369 5588220757 3551766515 9895519098 6665393549 
2254756478 9406475483 4664776041 1463233905 6513433068 
4607429586 9913829668 2468185710 3188790652 8703665083 
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8093996270 6074726455 3992539944 2808113736 9433887294 
6409089418 0595343932 5123623550 8134949004 3642785271 
3610045373 0488198551 7065941217 3524625895 4873016760 
8330701653 7228563559 1525347844 5981831341 1290019992 
4803118364 4536985891 7544264739 9882284621 8449008777 
3343657791 6012809317 9401718598 5999338492 3549564005 
5265311709 9570899427 3287092584 8789443646 0050410892 
6085902908 1765155780 3905946408 7350612322 6112009373 
2008007998 0492548534 6941469775 1649327095 0493463938 
7718819682 5462981268 6858170507 4027255026 3329044976 
9391760426 0176338704 5499017614 3641204692 1823707648 
6683008238 3404656475 8804051380 8016336388 7421637140 
2858829024 3670904887 1181900904 9453314421 8287661810 
7850171807 7930681085 4690009445 8995279424 3981392135 
9239718014 6134324457 2640097374 2570073592 1003154150 
4603726341 6554259027 6018348403 0681138185 5105979705 
6097164258 4975951380 6930944940 1515422221 9432913021 
3151558854 0392216409 7229101129 0355218157 6282328318 
3148573910 7115814425 3876117465 7867116941 4776421441 
4641766369 8066378576 8134920453 0224081972 7856471983 
1865452226 8126887268 4459684424 1610785401 6768142080 
2751674573 1891894562 8352570441 3354375857 5342698699 
3176085428 9437339356 1889165125 0424404008 9527198378 
1194988423 9060613695 7342315590 7967034614 9143447886 
0992391350 3373250855 9826558670 8924261242 9473670193 
6689511840 0936686095 4632500214 5852930950 0009071510 
2611341461 1068026744 6637334375 3407642940 2668297386 
3953669134 5222444708 0459239660 2817156551 5656661113 
2107119457 3002438801 7661503527 0862602537 8817975194 
5780371177 9277522597 8742891917 9155224171 8958536168 
0331147639 4911995072 8584306583 6193536932 9699289837 
7914710869 9843157337 4964883529 2796328220 7629472823 
3890119682 2142945766 7580718653 8065064870 2612289282 
0035838542 3897354243 9564755568 4095224844 5541392394 
6334893748 4391297423 9143365936 0410035234 3777065888 
5987746676 3847946650 4074111825 6583788784 5485814896 
4680977870 4464094758 2803487697 5894822824 1239292960 
6203534280 1441276172 8583024355 9830032042 0245120728 
4461670508 2768277222 3534191102 6341631571 4740612385 
7632356732 5417073420 8173322304 6298799280 4908514094 
9106024545 0864536289 3545686295 8531315337 1838682656 
1173213138 9574706208 8474802365 3710311508 9842799275 
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4426853271 9743113951 4357417221 9759799359 6852522857 
4059909935 0500081337 5432454635 9675048442 3528487470 
8654489377 6979566517 2796623267 1481033864 3913751865 
3471606325 8306498297 9551010954 1836235030 3094530973 
8992161255 2559770143 6858943585 8775263796 2559708167 
7772370041 7808419423 9487254068 0155603599 8390548985 
1345308939 0620467843 8778505423 9390524731 3620129476 
7689631148 1090220972 4520759167 2970078505 8071718638 
5203802786 0990556900 1341371823 6837099194 9516489600 
4639138363 1857456981 4719620184 1080961884 6054560390 
3066988317 | 6833100113 | 3108690421 | 9390310801 | 4378433415 
6746964066 | 6531527035 | 3254671126 | 6722246055 | 1199581831 
3936307463 0569010801 1494271410 0939136913 8107258137 
0373572652 7922417373 6057511278 8721819084 4900617801 
6322439372 6562472776 0378908144 5883785501 9702843779 
6029841669 2254896497 1560698119 2186584926 7704039564 
1212415363 7451500563 5070127815 9267142413 4210330156 
9300806260 1809623815 1613669033 4111138653 8510919367 
7086806445 0969865488 0168287434 3786126543 8158342807 
9025100158 8827216474 5006820704 1937615845 4712318346 
2264482091 | 0235647752 | 7230820810 | 6351889915 | 2692889108 
5904211844 | 9499077899 | 9200732947 | 6805868577 | 8787209829 
6814671769 | 5976099421 | 0036183559 | 1387778176 | 9845875810 
1338413385 | 3684211978 | 9389001852 | 9569196780 | 4554482858 
5726949518 | 1795897546 | 9399264219 | 7915523385 | 7662316762 
6771432242 | 7680913236 | 5449485366 | 7680000010 | 6526248547 
5453076511 6803337322 2651756622 0752695179 1442252808 
2754251508 | 6765511785 | 9395002793 | 3895920576 | 6827896776 
8271937831 2654961745 9970567450 7183320650 3455664403 
7842645676 3388188075 6561216896 0504161139 0390639601 
6464411918 | 5682770045 | 7424343402 | 1672276445 | 5893301277 
0545886056 | 4550133203 | 5786454858 | 4032402987 | 1709348091 
0987399876 | 6973237665 | 7370158080 | 6822904599 | 2123661689 
3216142802 | 1497633991 | 8983548487 | 5625298752 | 4238730775 
2606163364 | 3296406335 | 7281070788 | 7581640438 | 1485018841 
9006664080 | 6314077757 | 7257056307 | 2940049294 | 0302420498 
2827105784 | 3197535417 | 9501134727 | 3625774080 | 2134768260 
8245026792 | 6594205550 | 3958792298 | 1852648007 | 0683765041 
1964965403 | 2187271602 | 6485930490 | 3978748958 | 9066127250 
3223810158 | 7444505286 | 6523802253 | 2843891375 | 2738458923 
3300538621 | 6347988509 | 4695472004 | 7952311201 | 5043293226 
2630971749 | 5072127248 | 4794781695 | 7296142365 | 8595782090 
7557971449 9246540386 8179921389 3469244741 9850973346 
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8409146698 5692571507 4315740793 8053239252 3947755744 
2643744980 5596103307 9941453477 8457469999 2128599999 
6542616968 5867883726 0958774567 6182507275 9929508931 
8172932393 0106766386 8240401113 0402470073 5085782872 
1465242385 7762550047 4852964768 1479546700 7050347999 
1510102436 5553522306 9061294938 8599015734 6610237122 
0277722610 2544149221 5765045081 2067717357 1202718024 
4526379628 9612691572 3579866205 7340837576 6873884266 
1443545419 5762584735 6421619813 4073468541 1176688311 
9467300244 3450054499 5399742372 3287124948 3470604406 
3583446283 9476304775 6450150085 0757894954 8931393944 
7643800125 4365023714 1278346792 6101995585 2247172201 
7235467456 4239058585 0216719031 3952629445 5439131663 
9187497519 1011472315 2893267725 3391814660 7300089027 
1054967973 1001678708 5069420809 2232908070 3832634534 
7550493412 6787643674 6384902063 9640197666 8559232565 
3845534372 9141446513 4749407848 8442377217 5154334260 
1370924353 0136776310 8491351615 6422698475 0743032971 
9637637076 1799191920 3579582007 5956053023 4626775794 
8135789400 5599500183 5425118417 2136055727 5221035268 
3889710770 8229310027 9766593583 8758909395 6881485602 
3624078250 5270487581 6470324581 2908783952 3245323789 
8127810217 9912317416 3058105545 9880130048 4562997651 
6165356024 7338078430 2865525722 2753049998 8370153487 
3938352293 4588832255 0887064507 5394739520 4396807906 
5306184548 5903798217 9945996811 5441974253 6344399602 
0072629339 5505482395 5713725684 0232268213 0124767945 
4555711266 0396503439 7896278250 0161101532 3516051965 
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Like most branches of science, mathematics is also the history от 


paradoxes. It is natural, therefore, that the mathematics of random 
phenomena has revealed many interesting paradoxes. Most 
monographs on probability. theory and mathematical statistics, 
contain only a few classical paradoxes. This book, however, presents 
a more complete collection, including some which have not been 
published before, and shows how the mathematical methods of 
random phenomena have developed from such paradoxes. The book 
also provides a summary of historical and philosophical backgrounds. 
The study and understanding of paradoxes leads to better intuition, 
particularly in the area of probability, and this volume will be of 
interest to those involved in the study of random events. 
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